Thursday, August 8, 2024

PROBABILITY AND STATISTICS CHAPTER-1

 Basic Concept of Statistics:

The word “Statistics” has been derived from the Latin word “Status”, French word “Statistique”, Italian word “Statista” and German word “Statistik” each of which means  'political state. States used Statistics only to keep information for government purposes regarding the population, ‘poverty or wealth’ of the country, the number of polices, soldiers, fiscal policies etc. Statistics was regarded as the science of statecraft. So it was used in collection of the information.

The word “Staistics” is used in singular as well as plural sense. (i) Quantitative information of facts or simply data.

(ii) Statistical methods for handling numerical data.

The first is used in the plural sense. Statistics in the plural sense, it means the quantitative information or the numerical set of data or numerical facts collected systematically. The second is used in the singular sense. In the singular sense, it means the statistical methods and techniques adopted for collection, presentation, analysis and the interpretation of the numerical data.

Definition of Statistics

There is not a single definition of Statistics since the field of applicability has been widening day by day. As the field of applicability is extended, the definition of Statistics needs to be modified. Hence, it has been defined in a wider domain. Some comprehensive definitions are as follows.

Definition of Statistics in Singular sense

In the singular sense, it is also known as 'statistical method’ for the collection, presentation, analysis and interpretation of numerical data.

The most comprehensive definition given by Croxton and Cowden is:

"Statistics may be defined as the science of the collection, presentation, analysis and interpretation of numerical data."

Definition of Statistics in Plural sense In the plural sense, of all definitions Prof. Horace Secrist’s definition is the most comprehensive. According to him

"Statistics are aggregates of facts affected to a marked extent by multiplicity of causes, numerically expressed, enumerated or estimated according to reasonable standard of accuracy, collected in a systematic manner for a predetermined purpose and placed in relation to each other."

 

Statistics may also be classified into two parts, which are as follows.

Theoretical Statistics or Mathematical Statistics Applied Statistics

Theoretical Statistics or mathematical statistics

         Theoretical Statistics can further be subdivided into two parts.

        Descriptive Statistics

        Inferential Statistics

 

Descriptive Statistics

Descriptive Statistics merely describes the data and consists of methods and techniques used in collection, organization, presentation of data using table and chats, diagram, graph etc., summarizing data using measure of central tendency, dispersion, skewness, kurtosis etc. and analysis of data in order to describe various features and characteristics of such data is called descriptive statistics. Summarized results obtained from descriptive statistics which describe the data but cannot be used to generalize. Average, rates, ratios, percentages etc. are the basic summary statistics or descriptive statistics for summarizing the data depending on their nature. Hence, some most frequently used descriptive statistical measures are measures of central tendency, measures of dispersion, measures of skewness, measures of kurtosis etc.

 

Inferential Statistics

Inferential Statistics deals with the methods of drawing ( or inferring ) conclusions about the characteristics of the population based upon the results of the sample taken from the population.

In other words, Statistics which deals with sample selection from population and statistical technique used to draw conclusion about population on the basis of statistical measures obtained from sample is called inferential statistics.

It is used in estimation of parameters and statistical testing of hypothesis.

 

Applied Statistics

This consists of massive application of theoretical or mathematical Statistics in the different areas such as biology, physics, engineering,  astronomy, metrology, chemistry, medical science, sociology, psychology, business, economics , agriculture, Econometrics, bio-statistics or biometry,  etc. The statistical tools and methods are used in order to solve many more practical problems in diversified area. Besides that applied statistics has been used in decision making problems.

 

Functions of Statistics

Statistics has been playing a vital role almost every area in the universe. Some of important functions are listed below:

   To help classification of data.

   Statistics simplifies complexity.

   Statistics facilities for making comparison.

   Statistics present facts  in a definite form.

   To determine relationship between different phenomena.

   To help in formulation of policies.

   Statistics helps in predicting or forecasting future trends.

   Statistics helps in formulating and testing hypothesis.

   To draw valid inferences or conclusions.

 

 Importance and Scope of Statistics

In recent years, the development in statistical studies has considerably increased its scope and importance.  So, there is hardly any walk of life, which has not been affected by statistics. It has become one of the indispensable parts in almost areas. It is used in  the field of physical sciences, Biological sciences,  Medical  Sciences, Industry, Economics sciences, Social Sciences, Management Sciences, Information Technology, Engineering, Agriculture, Insurance, Business, management, accounting, finance, marketing,  production,  computer, information technology,  social sciences and many other areas.

 

Application Of Statistics In Engineering

Statistics is a critical tool (an important tool)  for robustness analysis, measurement system error analysis, test data analysis, probabilistic risk assessment, and many other fields in the engineering world. The key to the success of teaching statistics to engineering students is to make the statistics solution relevant to the engineering problems they face.   

 Probability and statistics are used throughout engineering. In electrical engineering, signals and noise are analyzed by means of probability theory. Civil, mechanical, and industrial engineers use statistics and probability to test and account for variations in materials and goods. Chemical engineers use probability and statistics to assess experimental data and control and improve chemical processes. It is essential for today’s engineer to master these tools.

     Variable 

In Statistics, a variable or characteristic may be defined as an attribute that describe a person, place, thing or idea under study. Therefore, a characteristic or measurement that may vary from one biological entity to another or place to place or time to time is called variable.  In other  words, a variable is a characteristic that varies from one person or thing to another .Therefore, a quantitative or qualitative characteristic  that  varies  from  observation  to  observation  in  the  same  group  is  called  a  variable.  For example, the variable may be height, weight, age, blood pressure, pulse rate, blood sugar, temperature, gender, ethnicity, religion, occupation, hair colour, eye colour, knowledge, nationality, religion, pain etc. Generally, variables are denoted by X or Y or Z.

 

Types of variable

The statistical data can be divided into two broad categories.  So, there are two types of variables

1.      Categorical variable or Qualitative variable

2.      Numerical variable or Quantitative variable

 

Qualitative variable

Qualitative variable is a variable or characteristic which cannot be measured in quantitative form (i.e. in number) but can only be identified by name or categories.  The qualitative variables are just categorized. It is also called categorical variable. Categorical variables take a value that is one of several possible categories. As naturally measured, categorical variables have no numerical meaning .The data are classified by counting the individuals having the same characteristics or attributes and but not by measurement. For Examples: Hair color (black, grey ,white) , gender: male/female, religion , nationality , marital status, disease: present/absent, eye color ( black, brown ,blue),  vaccine: vaccinated/ not vaccinated , smoke: smoking/not smoking, stages of  cancer (I, II, III, or IV), blood group ( A, B, AB, O) , degree of pain (minimal, moderate, severe or unbearable) etc.

 

The qualitative variable if it has only two categories is binary or dichotomous variable such as male/female, married/single, live/death etc. Data of this type are called nominal data and ordered categorical data such as pain: mild/moderate/severe and cigarette consumption: non-smoker/light smoker/ heavy smoker, socioeconomic status: lower, middle, upper are also called ordinal data.

 

Qualitative variable may be further classified into two categories

(i)               Nominal variable

(ii)             Ordinal variable

 

Nominal variable

Qualitative variables which can be categorized into various categories such that the numbers or symbols 0, 1, 2 etc. assigned have no numerical meaning are called nominal variables. It describes the differences but not the differences between the numbers. It does not satisfy ordinary arithmetic properties such as addition, subtraction, multiplication, division etc. 

For examples, gender (male/female), marital status (married/single), profession (Teacher, doctor, engineer, officer etc.), religion (Hindu, Muslim. Buddhist etc.), hair color (black, grey, white) etc.

 

Ordinal variable: Qualitative variables which can be ordered and ranked categorical data are called ordinal variables. For examples, pain: mild/moderate/severe; cigarette consumption: non-smoker/light smoker/ heavy smoker; socioeconomic status: lower, middle, upper; attitude towards certain fact: positive, negative, bad etc. 

Quantitative variable: A quantitative variable is one that can be measured and expressed numerically. Therefore, quantitative variable describes the characteristics in term if a numerical value, which are expressed in units of measurements. For example, height, weight, age. Income, pulse rate, blood pressure, level of hemoglobin in blood, number of children in a family, temperature records etc. These data may be represented by interval and ratio scale. Quantitative Variables can be divided into two types based on the nature of the characteristics. They are:

1.      Discrete variable

2.      Continuous variable

 

Discrete variable: A variable is said to be discrete if it takes only countably many values (i.e. whole numbers). It is a variable whose values are obtained by counting. 

For example,

 No. of goals scored in a football match.

No. of children in a family   :   1, 2, 3, 4, 6  

WBC count:  2ooo, 3010, 4060, 5050

Continuous variable: A variable is said to be continuous if it takes all possible real values (i.e. whole number as well as fractional values) within certain range. It is a variable whose values are obtained by measuring. For examples: height, weight, age, income, pulse rate, blood pressure, level of hemoglobin in blood, cholesterol etc.

i.e.     

Weight of children (in kg):   8, 9. 9.5, 9.8, 10, 10.3 

Height of patients (in feet):  4.6, 4.9, 5, 5.4, 5.6, 5.8, 6

 

Why it is important to know what type of variable is being measured

Categorical variables or qualitative variables are analyzed using non parametric tests; (those that do not assume a normal distribution). Examples of non-parametric tests are Chi square test, fisher’s exact  test, Wilcoxon Matched pairs test (signed rank test) , Mann-Whitney test (U test), Kruskal Wallis test( or H test), Sign test, Mc-Nemer test, Median test etc.

Quantitative variables (more specially, continuous variables) are analyzed using parametric tests (those that assume a normal distribution). Examples of parametric tests are Z test, Student’s t test (or t test), and F test (or variance ratio test).

Measurement and measurement scales

Measurement may be defined as the assignment of numbers to objects or events according to certain rules. Measurement of statistical data is essential for further statistical analysis. The choice of appropriate statistical technique depends upon the type of data. The data will always form one of four scales of measurement: nominal, ordinal, interval or ratio scale .The qualitative data can be measured in nominal and ordinal scale whereas quantitative data in interval and ratio scales. Therefore, there are generally four types of measurement scales, which are as follows.

1.      Nominal scale

2.      Ordinal scale

3.      Interval scale

4.      Ratio scale

Nominal scale

This is the simplest and lowest measurement scale. As its name, it consists of assigning numbers or symbols to events just to classify them into various categories such that the numbers assigned have no numerical meaning. It describes the differences but not the differences between the numbers. It does not satisfy ordinary arithmetic properties such as addition, subtraction, multiplication, division etc. 

For example, 

Numbers assigned to sex :  male, female

Religion: Hindu, Buddhist, Muslim, Christian. 

Occupation: Teacher, Manager, Doctor, Businessman, Civil service

All categories are mutually exclusive. Therefore, one cannot rank these and cannot say that a male is a higher value than female or teacher is a higher value than a manager. An important feature is that the different categories in no sense better or worse than another. They are simply different from one another. The symbols or numeric values such as 0 or 1 or 2 etc. are used just to identify the characteristics and have no any other meaning. 

Statistical measurement: Descriptive statistics associated with nominal data are frequencies and percentages, proportions or probabilities; tables and graphs: frequency tables, bar charts , pie charts are used; for central tendency mode is used  but it cannot be averaged only number in each category can be counted; dispersion can not be used i.e. mean and standard deviation can’t be used. For inferential statistics, Chi square test is the most common test applicable to nominal data , also McNemar test is used.

Ordinal scale 

Ordinal scale is slightly more sophisticated than nominal. All observations are not only different from category to category. When they can be ranked or ordered according to certain criterion or priorities or importance, then they are said to be measured on an ordinal scale. In this scale, the categories are ordered but differences cannot be determined i.e. distance between objects or ranks cannot be measured. Ordinal scales represent numbers, letters, or any symbols used to rank items. Simple arithmetic operations are not possible for ordinal data but ordinal data can also be on a continuum like excellent, good, fair and poor etc. Examples, symptoms of depression from a psychiatric assessment such as none coded by ‘0’, mild ‘1’, moderate ‘2’, severe ‘3’, etc. Similarly, Socioeconomic status : upper ,middle , lower ; Pain: mild , moderate, severe; Stage of cancer: 1,2, 3,4 etc., Preference to different banks such as NMB, Standard chartered , NABIL. Likert scale.

Statistical measurement: In ordinal data, frequency tables, bar charts , pie charts are used; median is an appropriate measure of central tendency, mode, frequencies and percentages can be used.  Percentiles and quartiles, range, maximum , minimum, standard deviation, variance analysis are used for measuring dispersion. In most cases, rank –order correlations i.e. Spearman’s rank correlation  coefficient can be used. Because of the nature of data, only non -parametric tests can be used such as Wilcoxon signed rank test, Mann Whiteny U test, Kruskall Wallis test.

Interval scale

The interval scale is a more sophisticated scale than nominal and ordinal in that with this scale not only it is possible to order measurement but also the distance between any two measurements is known. In this scale, the numerals with quantitative meaning are associated to the objects.

 This scale assumes that the data have equal intervals. The distances (or Differences) between two values or observations or objects or ranks can be measured. But no inherent starting point i.e. no minimum point. That is, there is no absolute zero and is only fixed by the convention. Basic arithmetic operations addition and subtraction can be used but the ratio of scores is not meaningful. Example, IQ, temperature;  the temperature of a body either in Celsius (0C) or Fahrenheit degree (0F). The difference between 20 degree

Celsius (680 F) and 25 degree Celsius (770F) is the same as the difference between 5 degree Celsius (410F) and 10 degree Celsius (500F). But the scale is not a Ratio Scale. This measurement scale shows equal intervals but it cannot say that 10 degrees is twice as hot as 5 degrees because there is no absolute zero.

Statistical measurement: In analyzing interval data, mean can be appropriately used to measure central tendency, standard deviation is widely used for dispersion, for correlation product moment correlation (i.e. Karl Pearson’s correlation coefficient ) can be calculated and for inferential statistics   Z test,  t test and F test can be used for significance testing.  

Ratio scale

This is the highest level of measurement and is used for quantitative data. Ratio scale is also called continuous data. It includes all the properties of interval scale as well as absolute zero value. The numbers associated to ratio scale are true numbers with true zero. The ratio between two values on this scale is a meaningful measure. Operations such as multiplication and division are therefore meaningful. Temperature measured in degrees Celsius or Fahrenheit is not a ratio variable because 0C does not mean there is no temperature. However, temperature measured in Kelvin is a ratio variable as 0 Kelvin (often called absolute zero) indicates that there is no temperature whatsoever. Other examples of ratio variables include height, mass, distance, weight etc.

 The name "ratio" reflects the fact that you can use the ratio of measurements. So, for example, a distance of ten meters is twice the distance of 5 meters. Various types of statistical analysis and mathematical operation can be used on ratio data. 

Statistical measurement: Geometric mean, harmonic mean, mean, standard deviation, coefficient of variation,  Karl Pearson’s correlation coefficient, regression etc. and most parametric statistical tests such as t-test, Z-test, F-test be used.

 

Data collection

By definition of Statistics, data is an aggregate of facts which can numerically be expressed. So, Data is one of the main sources of information. The first step of statistical approach (or study) is collection of data. Collection of data means the methods that are to be used for getting necessary information from units under investigation. The process of getting necessary information from the units under investigation is called collection of data. 

There are mainly two types of data on the basis of collection procedures. They are as follows

1.                Primary data

2.                Secondary data Primary data

Primary data are those fresh and original data, which are collected and recorded by the Investigator or researcher. Therefore, the data which are originally collected by investigator or researcher or an agent for the first time for the purpose of statistical enquiry (or investigation) are known as primary data. They are the first hand, unique, original, reliable and accurate in character. Primary data are collected for specific purpose of study of the investigator or researcher.

 

Methods of collecting primary data 

There are various methods of collecting primary data.

1.      Interview method

(a)    Direct personal interview method 

(b)   Indirect personal  interview method 

(c)    Telephone interview method

2.      Information through correspondence

3.      Mailed questionnaire method 

4.      Schedule sent through enumerators

5.      Internet

6.      Observation method.

 

Problems involved in collecting primary data

There are various problems that have to be faced while collecting primary data. The nature of problems depends upon the different situations. Some of the problems are as follows.

        Generally there is lack of time and money while collecting primary data.

        Transportation facility is essential for the enumerators or investigators to and from the research field.

This is one of the major problems for under developed countries like Nepal.      There is high degree of non-response error from illiterate respondents.

 

Secondary Data

Secondary data are those data which have been already collected and analyzed by some earlier agency for its own use; and later the same data are used by a different agency.  The data is a primary for those persons or institutions that collect them but the same data become secondary for another. Actually secondary data are the data, which are borrowed from others who have collected them for some other purpose. This type of data is not original in nature. The degree of accuracy of this type of data is comparatively less than that of the primary data.

 

Sources of Secondary Data 

In most of the studies the investigator finds it impracticable to collect first-hand information on all related issues and as such he makes use of the data collected by others. In order to collect secondary data, the following sources may be used. 

1       Published sources, and

2       Unpublished sources

Published Sources

The various sources of published data are:

        Reports and publications of ministries, departments of the government.

        Reports and publications of reputed INGO’S such as UNDP, ADB, UNESCO, WHO, World Banks etc.

        Reports and publications of reliable NGO’S, NDHS (Nepal Demographic Health Survey),

Journals, periodicals etc.

        Publications brought out by research agencies, research scholars, etc.

Unpublished Sources

All statistical material is not always published.  There are various sources of unpublished data such as:

        Records maintained by various Government and private offices.

        Records maintained by research institutions, research scholars, etc 

        Records updated by the departments institutions for their internal purpose.

Merits of secondary data

        It saves time and cost. 

        The scope of inquiry can be increased in terms of area and time period to be covered.

        Much of the secondary data available has been collected for many years and therefore it can be used to plot trends.

        It helps  the government  in making decisions and planning future policy

 

Demerits of secondary data

        Data may not be in the exact form of the requirement of the researcher.

        Some information is often omitted or some categories are pooled.

Advantages of Secondary Data

1.      It saves time and cost.

2.      If specially trained persons collect it, the quality of secondary data is better.

3.      The scope of inquiry can be increased in terms of area and time period to be covered.

Disadvantages of Secondary data

1.  Data may not be in the exact form of the requirement of the researcher.

2.  Some information is often omitted or some categories are pooled.

 

Difference between Primary and Secondary data

The difference between primary and secondary data is basically depends on the mode of collection of data. The data which are primary for one agency is treated as secondary for other and vice versa. However, the main differences between primary and secondary data are as follows:

Primary data

Secondary data

Primary data are original in the sense that they are personally collected by the investigator or researcher involving himself / herself.

 

Secondary data are not original in the sense that they are collected by someone other than the investigator or researcher.

 

Primary data collection is more expensive and exhaustive.

 

Secondary data are readily available at less expense.

 

Primary data are collected as per requirement of the investigator.

 

Secondary data might have been collected with different objectives.

 

    Primary data may be influenced by personal prejudice of the investigator.

 

Secondary data may not be influenced by the personal prejudice of the investigator.

 

 

Presentation and Classification of Data: Stem-and-Leaf Display

Classification 

Data collected from the primary sources, obviously enough, but are in a raw form, voluminous, unorganized mass of figures and are not fit for any statistical treatment (i.e. analysis and interpretation).   These unwieldy, ungrouped and shapeless masses of collected data are not easy to handle and are not capable of interpretation. So, after collecting and editing the data, the next step is to condense or organize and simplify them in such a way that irrelevant details are eliminated and their significant features stand out prominently.

Therefore, the appropriate procedure for this is the classification and presentation of data.

Classification of data:

Classification is the process of arranging the primary data into different groups or classes according to their similarities. Classification of data helps to change the voluminous heterogeneous data into homogenous groups so that its important characteristics can easily be grasped. It also makes the data more useful for further processing such as tabulation, analysis and interpretation. Thus, Classification is the first step in tabulation because the items with similarities must be brought together before the data are presented in the form of a table. 

Objectives of classification 

       The main objectives of classification are    The classified data is more easily understood.

        It presents the fact into a simpler form.

        It condenses the huge mass of data by eliminating irrelevant details so that similarities and dissimilarities can easily be brought out.

        It facilitates for quick comparison.

        It helps for further statistical treatment such as average, dispersion etc.

        It detects the error easily.

        It prepares the basis of tabulation.

Types of classification

         The collected data can be classified or grouped as follows:

        Geographical classification 

        Chronological classification

        Qualitative classification

        Quantitative classification Geographical classification:

In this classification, the data are classified according to place, area, region etc.  For example,          The population density in different region of Nepal 

 

Region

Population Density (per sq. km.)

Eastern

Central

Western

Mid –western

Far-western

188

293

155

71

112

 

Chronological classification

A type of classification in which, the data are classified according to variation of time. The time series data are good examples of chronological classification. E.g.

     Population Growth rate 

Year

Growth Rate (In Percentage)

1961

2.10

1971

2.07

1981

2.24

1991

2.66

2001

1.65

Qualitative classification

Classification is said to be qualitative when the data are classified on the basis of some attribute or quality or descriptive characteristics, which are not capable of being described numerically. These types of data are known as categorical data or qualitative data. For example sex, nationality, honesty, color of eye, religions etc. 

A classification with two sub divisions with one attribute is known as simple or 'two fold' classification. If more than one attribute is to be studied simultaneously, the data should be divided into a number of classes and this classification is known as 'manifold' classification. Simple Classification

 

Manifold Classification

 

 

Quantitative classification 

Classification is said to be quantitative when the data are expressed numerically. In other words, classification of data according to some characteristics which can be measured quantitatively is known as quantitative classification. These types of data are known as numerical data or quantitative data. Height, weight, age, profit, income, pulse rate, sales, turnover, blood pressure etc. are some of examples of this type of data.  e.g.

Height in feet

No. of persons

4

4.8

5

5.5

5.10

6

5

10

7

18

6

3

 

Stem-and-Leaf Display

Stem and leaf display is another graphical device that is useful for representing quantitative data sets. It became more commonly used in 1980 after the publication of John Turkey’s book explanatory data analysis (EDA) in 1977 A.D. 

Stem-and-leaf display is another form of the presentation of data. But for it, the observations must first be arranged in ascending order (i.e. an ordered array). Steam- and- leaf display or Stem- and- leaf plot is a valuable statistical tool for organizing a set of statistical data. The basic idea of this display is based on the analogy of plants. Therefore, to construct a stem-and-leaf display, we partition each measurement into two parts. The first part is called stem and second part is called leaf. The greatest common place value of data is used to form the stem and the next greatest common place is used to form leaves. Stem-and-leaf display bears a strong resemblance to a histogram and is helpful to understand how the values are distributed and where the concentrations of the data occur over the range of observations in the data set. 

Stem and leaf display provides how the data are distributed and where the concentration of the data occur.

Thus, Stem-and-leaf display helps to have an idea about the shape of the distribution of the given data values.

Example 1

Following are the weights (in Kgs.) of 40 students. Construct the stem and leaf display from the following data given below.

85

50

48

45

54

52

54

77

63

65

85

86

87

66

67

81

82

83

66

68

67

68

69

55

56

91

92

66

68

69

84

45

47

55

77

55

54

71

72

73

 

 

 

 

 

Solution

Arranging the given data in ascending order of magnitude (i.e. an ordered array) is as follows:

41

45

45

45

48

50

52

54

54

54

55

55

55

56

63

65

66

66

66

67

67

68

68

68

69

69

71

72

73

77

81

82

83

84

85

85

86

87

91

92

 

 

 

 

 

The stem and leaf display is as follows.

Stem

Leaves

4

1    5    5   5   8

5

0   2    4    4    4     5    5     5    6

6

3   5    6    6    6    7   7    8    8   8   9    9

7

1   2    3    7

8

1   2    3    4    5    5    6    7

9

1   2

Example 2

The following data represent the annual expenses (in thousand of rupees) on food items in a city.

         13.8,    14.1,  14.7,  15.2,  12.8,  15.6,  14.9,  16.7,  19.2,  14.9,  14.9,  14.9, 15.2, 15.9, 15.2, 14.8,

19.1,  14.6,  18.0,  14.9,  14.2,  14.1,  15.3,  15.3, 15.5  18.0,  17.2,  17.2,  14.1,  14.5,  18.0, 14.4,  14.2,  14.6, 14.2, 14.8.

a.  Place the data into an ordered array.

b.  Construct the stem and leaf display.

c.  Which of these two displays seems to provide more information? Discuss.

Solution: (a) Arranging the given data in ordered array (i.e. in ascending order) as follows:

12.8,  13.8,  14.1,  14.1,  14.1,  14.2,  14.2,  14.2,  14.4,  14.5,  14.6,  14.6,  14.7,  14.8,  14.8,  14.8,  14.9,  14.9,  14.9,  14.9,  14.9,  15.2,  15.2,  15.2,  15.3,  15.5,  15.6,  15.9,  16.7,  17.2,  17.2,  18.0, 18.0,  18.0, 

19.1,  19.2

b)  The stem and leaf display is as follows.

Stem

Leaves (0.1)

12

8    

13

8

14

1    1    1    2    2    2   4   5   6    6    7    8    8    8     9    9    9    9    9

15

2    2    2     3    5   6   9 

16

7

17

18

19

2    2

0        0

1        1

 

c)  An ordered array provides only how the data are distributed i.e. an ordered array only shows  the ordered values from the smallest value to the largest but stem and leaf display provides two information: i) How the Data or (values) are distributed.

i.e. Lowest annual expenses = Rs. 12 (in thousands) Largest annual expenses = Rs.19.1 (in thousands)

ii) And, Where the concentration of the data occurs and entirely the shape of the distribution i.e. the most of the annual expenses is Rs.14.9 (in thousands) since it repeats maximum number of times i.e. 5 times.

Therefore, the stem and leaf display provides more information than ordered array.

 Example 3:

Construct stem and leaf display from the following data

5.35, 4.75, 4.30, 5.47, 4.85, 6.62, 3.54, 4.87, 6.26, 5.48, 7.27, 8.45, 6.05, 4.76, 5.91

Solution: Arranging the given data in ascending order

3.54, 4.30, 4.75, 4.76, 4.85, 4.87, 4.87, 5.35, 5.47, 5.48, 5.91, 6.05, 6.26, 6.26, 6.62, 7.27, 8.45 Round the decimal in one decimal place then above data becomes:

3.5, 4.3, 4.8, 4.8, 4.9, 5.4, 5.5, 5.5, 5.9, 6.1, 6.3, 6.6, 7.3, 8.5

 

            Stem and leaf display

 

Stem

Leaves (0.1)

3

5

4

3    8    8    9    9

5

4    5    5    9

6

7

8

1    3   7    6

3

5

   Example  4

From the following stem-and-leaf display regarding the age (in years) of Glaucoma patients treated in Tilganga Eye Institute, Kathmandu, construct a histogram using class width of 10.

Stem

Leaves

4

0  2  3  4  6

5

0  1  2  5  5  7

6

0  1  2  3  5  5  5  7  8  9  9

7

1  2  4  5  8

8

0  4  7

a.  What percentage of cases is below 60?

b.  In which age group is the concentration of values lie?

c.  Construct also the relative percentage frequency distribution.

d.  What percentage of patients lies above 70?

Solution:

a)          Total number of cases = 33

Number of cases below 60 = 11

Percentage of cases below 60 = = 33.333 %

b)         The concentration of values lie in age group 60 to 69.

c)          Relative percentage frequency =  

Construction of relative percentage frequency distribution

Age (in years)

  No. of Glaucoma  patients 

(f)

Relative  percentage Frequency

40-50

50-60

60-70

70-80

80-90

90-100

5

6

11

5

3

3

 

 

 

 

 

 

 

N  = 33

 

d)         Number of patients lies above 70 = 11 

Percentage of patients lies above 70 = %

 Variable or variate (Quantitative variable)

Any quantitative characteristic under study is known as quantitative variable (or variate). Generally, variables are denoted by X or Y or Z. The values taken by the variable are known as variate values. Basically there are two types of variables.

1.           Discrete variable

2.           Continuous variable

 

 Discrete variable: A variable is said to be discrete if it takes only countably many values (i.e. only whole numbers values). For example: no. of goals in a football match, no. of children in a family, WBC count, no. of bed in hospital wards etc.  

Continuous variable: A variable is said to be continuous if it takes all possible real values (whole number as well as fractional values) within a certain range or limits. For examples: height, weight, age, income, pulse rate, blood pressure, level of hemoglobin in blood, cholesterol, temperature, marks obtained by students etc.

Raw Data

Raw data, also known as primary data, is data collected from a source. Raw data are neither arranged in ascending order nor in descending order of magnitude. If data are arranged in either ascending order or descending order is known as arranged data or “Ordered array”.

Frequency

The number of times which a variate value occurs is known as frequency. It is generally denoted by (f).

 

  Tally Bars (or Tally Marks)

The number of repetition of a variable value denoted by vertical lines (| |) are known as tally bars or tally marks. It facilitates to count the huge mass of repeated data.

 

Frequency Distribution

A table consisting of variable values and along with their corresponding frequencies is known as frequency distribution. It is a summary table in which data (values of variable) are arranged numerically ordered with their corresponding frequencies.  Example:

Variable

15

20

10

25

Frequency

2

7

5

3

          

Types of frequency distribution

The distribution having only one variable is called univariate frequency distribution. A univariate frequency distribution further can be classified into three categories:

1.  Individual series 

2.  Discrete series or Discrete frequency distribution.

3.  Continuous series or continuous frequency distribution.

 

   Individual series

The series in which items (i.e. values of variable) or data are listed after observations in either organized or unorganized form is called individual series.

Examples:  (i) Marks obtained by a student in different subjects

Subject

Maths

Statistics

Account

Computer

Economics

Finance 

Marks

55

61

55

58

62

50 

 

         (ii) The marks obtained by 20 students in a subject are given below:

23, 25, 69, 15, 70, 80, 50, 30, 25, 23, 50, 90, 70, 70, 80, 25, 25, 85, 80, 23

 Discrete series or Discrete frequency distribution

The series formed from a discrete variable along with their corresponding frequencies is known as discrete series or discrete frequency distribution. For example:

No. of goals scored 

0

1

2

3

No. of matches

5

3

8

4

Continuous series or Continuous frequency distribution or Grouped frequency distribution

The series formed from a continuous variable with their corresponding frequencies is known as continuous series. It is also called continuous frequency distribution or grouped frequency distribution. In this frequency distribution, the variable can take all possible real values (whole number as well as fractional values) within a certain range or limits. These limits may or not may be equal. 

Income (in ‘000’ Rs.)

30-40

40-50

50-60

60-70

70-80

80-90 

No. of persons

8

13

15

4

6

 

Ungrouped Frequency Distribution

A table consisting of variable values and their corresponding frequencies is known as ungrouped frequency distribution. In this distribution, data are not classified (or grouped) into different classes or groups. For example:

Marks 

20

30

40

45 

No. of students

2

7

5

 

 Grouped Frequency Distribution

Classification of data into different classes or groups according to their characteristics is known as grouped data. Therefore, a table consisting of grouped or classified data and their corresponding frequencies is known as grouped frequency distribution. 

Example:

Income (in 000 Rs.)

10-20

20-30

30-40

40-50 

No. of employees (f)

5

8

6

            Class size: Difference between upper limit and lower limit of each class interval is also called class size or width of class or magnitude of class interval.

            Class Frequency: The frequency for each class interval is called class frequency.

            Mid-Point (mid. value) of Class Interval: It is defined as the average between lower limit and upper limit of each class interval. For example: Referring to above example.

Mid-point for first class =  = 15.

         Similarly, others can be calculated.

General Guidelines for Classification of Data

There is no hard and fast rule to classify the data. However the following points should be kept in mind.

            Classes should be clearly defined.

            Width of each class interval in the distribution should be equal as far as possible.

            As much as possible, open-ended classes should be avoided.

            Number of classes should be neither too large nor too small (i.e. generally 5 to 15) According to H.A. Struges, the approximate number of classes (k) is calculated by 

(i) Approximate no. of classes (k ) = 1+ 3.322 log n   

            & (ii) width of class (h) =   

      Where         n = number of observations.

       L = largest value

       S = Smallest value

            Class should be non-overlapping.

            Class frequencies should be obtained by using tally bars or tally mark.

 

Example 1

If class size (h) is given:

The following are the marks obtained by 60 students in Mathematics. Construct a grouped frequency distribution starting from 10 so that the width of each class interval is also 10.

52

44

38

59

51

71

20

44

30

47

40

63

39

59

52

72

23

45

39

60

54

30

47

40

14

73

25

45

39

61

52

73

64

55

30

47

28

46

39

61

52

74

28

46

40

65

55

33

29

62

54

28

46

40

62

54

47

40

65

55

Solution:

In this data the smallest value of marks is 14 and the largest value is 74. The grouped frequency distribution of marks obtained by 60 students is as follows.

Marks

Tally bars

Number of Students  (f)

10-20

|

1

20-30

||||  ||

7

30-40

|||| ||||

9

40-50

||||  ||||  |||| |

16

50-60

||||  |||| |||

13

60-70

|||| ||||

9

70-80

|||| 

5

 

 

N = Σf = 60

 

Example 2   

If class size (h) is not given:

The weights in grams of 50 apples, picked from a box are as follows:

110    103    89        75      98       121      110    108      93     128    185     123      113     92      86       70     126     78      13        120      29      119      105    120    100    116     85       99     114    189    205      87     111    141     136      123      90      115      128    160     78      90     107     81      137     75      84     104   109    115

Construct frequency distribution table with suitable number of classes. 

Solution: Since, class size (h) is not given, therefore at first it needs to find the approximate number of class intervals (k) and class size (h) 

Number of observations, n = 50, S = smallest value = 70, L = Largest value = 205

 By Struges formula, 

Number of classes, K = 1 + 3.322log n = 1 + 3.322 log50 = 1+3.322× 1.6989 = 6.64 ,         R = 205 – 70 = 135.

Class width or class size, h =  

Weights (in gms.)

Tally bars

No.of  apples (f)

70-90

90-110

110-130

130-150

150-170

170-190

190-210

||||   ||||  |

||||   ||||  |||

||||  ||||   ||||   |||

||||  

|| 

|

11

13

18

4

1

2

1

 

 

N = ∑ f = 50

 

Example 3

 If class size (h) is not given:

The birth weights (kg) of 30 children were recorded as follows:

2.0

2.1

2.3

3.0

3.1

2.7

2.8

3.5

3.1

3.7

4.0

2.3

3.5

4.2

3.7

3.2

2.7

2.5

2.7

3.8

3.1

3.0

2.6

2.8

2.9

3.5

4.1

3.9

2.8

2.2

Prepare a grouped frequency distribution for above data.

 Solution:          Since, class size (h) is not given

Total no. of observation (n) = 30

Smallest value or observation (S) = 2

Largest value or observation (L) = 4.2

         By Struges formula, 

Number of classes, K = 1 + 3.322log n = 1 + 3.322 log30 

                           = 1+3.3221.477121=5.906996 ≈ 6 

                     Class size or class width (h) =  

                  Now, the frequency distribution is shown below as:

Birth weight

Tally bar

Frequency (f)

2.0 – 2.4 

||||

5

2.4 – 2.8

||||

5

2.8 – 3.2

||||  ||||

9

3.2 – 3.6

||||

4

3.6 – 4.0

||||

4

4.0 – 4.4

|||

3

 

Note: This rule should not be regarded as final but should be considered as a guide only.

The formula is; K = 1 + 3.322log n

Where, K is the number of classes and n is the number of values or observations in the data set under consideration.

Size of the class intervals may be decided by using the following formula.

R

         h  = K

Where, R = Range = Largest item – Smallest item under consideration.              K = the number of class intervals.

 

Types of Class Intervals

Exclusive type of class interval: The class intervals in which the upper limit of preceding class is the lower limit of succeeding class so as to make continuous without any gap are known as exclusive class intervals. In exclusive class intervals, the upper limit of each class interval is excluded in that class but included in the next class. 

For example:

Marks

 

60-70

70-80

80-90

90-100

No.

students

of

10

15

5

2

Inclusive type of class interval: If the upper limit as well as the lower limit of each class interval is included in that class interval, then class intervals are known as inclusive type of class interval. In this type of class intervals the upper limit of preceding class is not the lower limit of succeeding class.

The width of this type of class interval = Lower limit of next C.I. – Lower limit of previous C.I. For example:

Marks

60-69

70-79

80-89

90-99 

No. of students

21

10

6

Conversion of Inclusive Class Interval to Exclusive Class Interval

Generally, in statistical computation, we prefer exclusive class interval. To convert inclusive class interval to exclusive class interval, let us define Correction factor as below.

Correction Factor, C.F. =    

Now,  real lower limit = Lower limit – C.F.           real upper limit = Upper limit + C.F.

Example 4

Convert the inclusive frequency distribution into exclusive frequency distribution from the data given below.

Marks

50-59

60-69

70-79

80-89

90-99 

No. of students

10

15

30

20

12 

Solution:

Here,  

Correction factor, C.F. =  

                                  =   =  = 0.5

          Real lower limit = Lower limit – C.F. = Lower limit – 0.5

         Real upper limit = Upper limit + C.F. = Upper limit + 0.5

Then, exclusive frequency distribution is given below

Marks

 49.5-59.5

59.5-69.5

 69.5-79.5

79.5-89.5

89.5-99.5

No. of

students

10

15

30

20

12

Example 5

 Convert the following data into exclusive class:

Interval

10-14

15-19

20-24 

25-29 

30-34 

Frequency

11

15

Solution: 

Correction factor, C.F. =   

Then, exclusive  frequency d istribution is gi ven below 

 

 

Interval

9.5-14.5

14.5-19.5

19.5-24.5 

24.5-29.5 

29.5-34.5 

Frequency

11

15

 

Open-ended classes: If the lower limit of the first class or the upper limit of the last class or both are not specified, then such class is called open-ended class.

For examples:

 

(a)

 

 

(b)

 

 

(c)

 

Marks 

Below 50 

50-60 

60-70 

 

70-80 

No. of students 

10 

20 

12 

 

 

 

 

 

 

Marks 

Below 60 

60-70 

70-80 

Above 80 

No. of students 

10 

20 

12 

 

 

 

 

 

Marks 

 

50-60 

60-70 

70-80 

Above 80 

No. of students 

 

10 

20 

12 

Cumulative Frequency Distribution

After arranging the classes (Variate values) in ascending order of magnitude, the frequency distribution in which the frequencies are cumulated either from top to bottom or bottom to top is called cumulative frequency distribution. There are two types of cumulative frequency distribution. 

(a) Less than cumulative frequency distribution (b) More than cumulative frequency distribution.

Less than cumulative frequency distribution

After arranging the classes (Variate values) in ascending order of magnitude, the frequency distribution in which the frequencies are cumulated from top to bottom is called less than cumulative frequency distribution.

Example

Marks

No .of students (Less than c.f.)

Less than 10

Less than 20

Less than 30

Less than 40

5

12

20

22

Less than 50

 

30

        OR

Marks (Less than) 

10

20 

30

40

50 

No. of students

5

12 

20

22

30 

 

Conversion of above less than cumulative frequency distribution into simple frequency distribution

 

Marks

No .of  students

(f)

0 – 10

20 -20

20 -30

30 -40

40 -50

 

5

12-5=7

20-12=8

22-20=2

30-22=8

 

More than cumulative frequency distribution

After arranging the classes (Variate values) in ascending order of magnitude, the frequency distribution in which the frequencies are cumulated from bottom to top is called less than cumulative frequency distribution. Example:

Income        (in Rs.)

‘000’

No .of persons (More than c.f.) 

Above 10

 Above 20

Above 30

Above 40

Above 50

 

 

35

27

20

8

3

               OR

Income (in ‘000’

Rs.) 

(More than) 

10 

20 

30 

40 

50 

No. of persons 

35 

27 

20 

 

Conversion of above more than cumulative frequency distribution into simple frequency distribution

 

Income Rs.)

(in ‘000’

No .of  persons (f)

0 – 10

20 -20

20 -30

30 -40

40 -50

 

 

35-27=8

27-20=7

20-8=12

8-3=5

3

 

 

 

 

Relative Frequency

Relative frequency is simply the proportion (ratio) of the observations belonging to the class with total observations. That is,

  Frequency of class

Relative Frequency of class =     =      N     

If the relative class frequency is multiplied by 100, then it is called the percent frequency of a class.

Relative percentage Frequency of class =   

 

Example 6

 Prepare simple frequency  distribution, relative frequency  distribution, percentage frequency distribution and more than cumulative frequency distribution from the following data:

Weight(kg)

Below 10

below 20

below 30

below 40

below 50 

No.of children

5

9

12

15

19 

                                    OR 

 

Weight ( in kg)  (Less than )

 10

 20

 30

 40

 50 

No.of children

5

9

12

15

19 

 

Solution: 

Conversion of above less than cumulative frequency distribution into simple frequency distribution      Simple frequency distribution

Weight (in kg)

No. of children (f)

0-10

10-20

20-30

30-40

40-50

5

9-5= 4

12-9= 3

15-12 = 3

19-15= 4

 

N = ∑ f = 19

 

Since, Relative Frequency of class =  

  & percentage Frequency of class =   

 

Relative frequency distribution and percentage frequency distribution

 

Weight (in kg)

No. of children (f)

Relative frequency

Percentage frequency

0-10

10-20

20-30

30-40

 5

 4

 3

 3

5/19 = 0.26

4/19 = 0.21

3/19 = 0.16

3/19 = 0.16

26

21

16

16

40-50 

 4 

4/19 = 0.21 

21 

 

N = ∑ f = 19  

100 

 

       More than cumulative frequency distribution

 

Weight (in kg)

More than c.f.

More than 0

More than 10

More than 20

More than 30

More than 40

14+5 = 19

10+4= 14

7+3 = 10

4+3  = 7

4

 

Example 7

Calculate relative frequency and percent frequency for each class from the following frequency distribution.

Blood pressure( mmHg)

125-135

 135-145

145-155

155-165

165-175

Total 

No. of patients (f )

10

15

37

12

6

N = 80 

Solution

We have,

Frequency of class

Relative frequency = Total observations

Percent frequency = Relative frequency × 100

Calculation of relative frequency and percent frequency:

lood pressure (in mmHg)

 No. of patient

(f)

s Relative  Frequency

 Relative percentage Frequency

125-135

10

10/80 = 0.13

13

135-145

15

15/80 = 0.19

19

145-155

37

37/80 = 0.47

47

155-165

12

12/80 = 0.15

15

165-175

6

6/80 = 0.06

6

 

N  = 80

1

100

 

Example 8

Following are the marks in statistics obtained by 25 students out of 50 full marks of Pokhara University.

40, 25, 30, 22, 35, 18, 28, 38, 45, 34, 31, 15, 24, 19, 20, 26, 09, 18, 07, 29, 35, 43, 41, 40, 30

Construct a grouped frequency distribution with suitable class size (or suitable class width) and prepares relative frequency distribution.

Solution:

Since, class size or class width (h) is not given, therefore at first number of classes (k) and class size (h) are calculated.

Number of observations, n = 25, S = smallest value = 7,   L = Largest value = 45

 By Struges formula, 

Number of classes, K = 1 + 3.322log n = 1 + 3.322 log25 = 1+3.3221.39794 =5.643≈ 6

   Class size or class width, h =  

MARKS

Tally bars

No.of  students (f)

Relative frequency =  

7-14

14-21

21-28

28-35

35-42

42-49

 

||

||||   

||||

|||| |

|||| |

 ||

2

5

4

6

6

2

 

0.08

0.2

0.16

0.24

0.240

0.08

 

 

N = ∑ f = 25

1

Example  9

Find out class intervals from the following frequency distribution.

Mid. point

5

10

15

20

25

30

Frequency

1

3

5

10

2

1

Solution

Class size or width of class (h) = Difference between two successive mid. values

                                                 = 10 5

                                             h   = 5

                                                 

Subtract the value of  from the mid. value to get lower limit and add the value of    to the same mid. value to get the upper limit and so on. 

Hence, the grouped frequency distribution of the given problem is as follows.

Class interval

2.5-7.5

7.5-12.5

12.5-17.5

17.5-22.5

22.5-27.5

27.5-32.5

Frequency

1

3

5

10

2

1

  

                                                                                                                                 

  Diagrammatic presentation of data

After classification and tabulation of numerical data for statistical enquiry, the next method of presenting the numerical data is the use of diagrams. Presentation of statistical data through the use of diagram is known as diagrammatic presentation.

Diagrams are nothing but the presentation of statistical data in the form of geometrical figures like bars, rectangles, circles etc.

General rules for constructing of diagrams

There are no hard and fast rules for constructing diagrams. However, the following points have to be kept in mind while constructing diagrams. 

Title: The title should be brief self-explanatory and clear. The title should be written either at the top or at the bottom of the diagram. Generally, it is kept at the top of the diagram.

Choice of a diagram: Proper diagram should be chosen depending upon the nature of data.

Simplicity: As much as possible, the diagram should be simple and easy to understand.

Proportion between dimensions: A proper proportion between the dimensions (i.e. height and weight) of the diagram should be maintained.

Neatness and cleanliness: As diagrams are visual aids to present the numerical data, the diagrams should be neat and clean.

Selection of scale: The most important thing is the selection of a proper scale. The scale (i.e. horizontal and vertical scales) should clearly be shown on the diagram.

Choice of a diagram: Proper diagram should be chosen depending upon the nature of data.

Footnote: If necessary, the foots should be given at the bottom of the diagram to explain certain facts or points which are not included in the title.

Index: A brief index showing various types of shades, colours, lines should be used for clear understanding of the diagram.

There are different methods of presenting data in the diagram which are as follows:

        Bar diagrams (One dimensional diagram)

        Pie-diagrams or pie-charts. (Two dimensional diagram) Bar Diagrams (One dimensional diagram)

Bar diagrams are one of the simplest, easiest and commonly used diagrammatical methods of presenting scientific, health, medical, engineering, business, economic, research data. Bar diagrams consists of a set of rectangles one for each set of data in which the magnitudes or values are represented by the length (or height). Bar diagrams are one dimensional diagram since only one dimension (i.e. height or length) of the bars are taken into consideration. It is more suitable for nominal or ordinal categories.

Before drawing bar diagrams, the following points are to be kept in mind:

        All the bars must be drawn on the same base line.

        The breadths of the bars must be same.

        The height (or length) of the bars must be proportional to the magnitude (or values).

        Bars can be drawn vertically or horizontally depending upon the convenience.

        The gap (distance) between the two successive bars should be same. Types of bar diagrams

        Simple bar diagram

        Sub-divided or component bar diagram

        Percentage sub-divided bar diagram 

        Multiple bar diagram

Simple bar diagram

Simple bar diagram is a diagram for only one variable. It is the simplest among other bar diagrams and frequently used for the comparative study of two or more values of a single variable. It consists of a set of equidistant bars (or rectangles) of equal width. The heights or lengths of the bars are represented by the given values of the variable.

  Example 1 The following information provides the number of tourists by major nationality in 1995 in Nepal. Present the data in the form of a suitable diagram.

Solution:

Nationality

No. of tourists (millions)

India

UK

USA

Australia

Others

2.9 1.2 2.5 1.8

1.5

 

Nation

Example 2

 

Prepare the simple bar diagram from the following data:

Year                             :   2002       2003      2004     2005    2006

No. of Patients’ record:    400         150          300      200      500

 

 

Simple bar diagram is useful to represent only one characteristic or classification at a time. But there are some situations when two or more characteristics have to be presented at a time, then sub-divided or component bar diagram will be suitable. Sub divided or component bar diagram is a diagram for two or more components of a total. Therefore, it is particularly useful if the total magnitude of the considering variable is to be divided into different parts or segments. 

Example1

Represent the following information regarding work participation rate by ecological region of Nepal through sub-divided bar diagram.

Region

Years

1971

1981

1991

Mountains

70.0

75.7

74.5

Hills

62.9

68.8

62.8

Terai

51.3

58.8

48.6

Solution

 

Sub-divided bar diagram

 

Index

Mountai

Hills

Terai

 

         Year

  Example 2

Represent the following data by means of a sub-divided bar diagram.

Year                                             Patients

                              Male             Female       Children

2002                      100                  200              85 2003                      200                  180             150

Solution:

Year                                             Patients

                              Male             Female       Children        Total

2002                                  100                  200              85              385

2003                                  200                  180             150             530 

Index

Male

Female

Children

 

Year Example3

Represent the following information by sub-divided bar diagram.

Items

Expenditure

of

Family A

(Income Rs. 5000)

Family B

(Income Rs. 3000)

Food

 

1500

1500

Clothing

 

1250

600

Education

250

 

500

Miscellaneous

1900

 

700

Saving or deficit

+ 100

Sub-divide

d bar diagram-300    

 

 

 

Index

Percentages Sub-divided Bar Diagram

Percentage sub-divided bar diagram is also sub-divided bar diagram but presented data in terms of percentage. It is used for comparing the relative changes in the data. In order to show the relative changes in the data, percentage sub-divided bar diagram will be more appropriate than simple sub-divided bar diagrams.

 The height of each bar is same i.e. 100, since sum of percentage values is equal to 100. 

   Example 1

Percentage breakup of the cost of construction a house in Kathmandu (land cost excluded) is given below:

Labour = 25%      Bricks=15%    Cement = 20%

Steel = 15%           Timber=10%    Supervision=15%

Construct the percentage sub-divided bar diagram

Solution:

 

Multiple Bar Diagram

Multiple bar diagram is also one dimensional bar diagram and it is used to compare two or more interrelated sets (series) of data through bar diagram. Therefore, in multiple bar diagram multiple characteristics of the same set of data have to be presented and compared. A constant distance should be maintained between the bars of interrelated sets of data. In order to differentiate bars, different colours or shades are used.

 Suppose the records of passed students in first, second and third division in a certain examination for different years of a certain school can be represented by using multiple bar diagram as shown below.

 

Multiple bar diagram showing passed students based on hypothetical data

   Example 1

 Represent the following data of health personnel per 100,000 population and its growth between 1991/92 and 2010/11 by means of multiple bar diagram using excel.

 

 

 

Personnel

Nu

mber of Personnel in

1991/92

2001/02

2010/11

Doctor

Nurse

Health Assistance

Health Worker

5

10

7

2

6

15

10

30

10

25

20

90

   Example 2

          

 Represent the following data with the help of a multiple bar diagram.

Country

No. of Tourists in Year 2010

No. of Tourists in Year 2015

USA

UK

France

India

China

Others

220

140 80

500

300

150

300

200

150

350

250

280

 

Solution:

 

 2010          2015


  Example 3      Year

Draw multiple bar diagram from the following information:

Disease

 

Household percenta

ge

Terai

Hills

Mountains

Jaundice

5.5

4.0

6.0

Typhoid

7.0

5.5

8.0

Diarrhoea

12.9

14.1

20.5

 

 

Graphical Presentation of Data

Like diagrammatic presentation of data another method of presenting this is a graphical presentation so that the information regarding the presented data can be easily understandable. Even when data show complex relations among variables, graphs also make them much clear at a glance. Graphs facilitate to interpret and to analyze the data. They are much more helpful in depicting median, mode, skewness, kurtosis, correlation, regression and time series analysis etc.

Graph should be neat, simple, clear and self explained. It should also have self- explained title. If necessary, footnotes, source notes and indices are also to be kept in appropriate place in the graph. Arrangements of scales are also the important part while constructing graphs.

Generally graphs can be divided into two categories which are as follows.

(c)   Graph of time series

(d)   Graph of frequency distribution or frequency graph (e)  

Distinguish Between Diagrams and graphs

Diagrams

Graphs

1.    They are constructed on plane paper.

2.    They may be one, two, and three  dimensional.

3.    The numerical data in diagrams are  presented by bars, rectangles, circles,  cubes, etc. 

4.    They are only for the comparison.

5.    Construction of diagrams is easier than  that of construction of graph.

6.    Diagrams are more attractive and common  to all individual.

7.    Presentation of frequency distribution in  diagram is not used.

8.    They are rarely used by statistician and  research workers.

1.    They are constructed on graph paper.

2.    They are of two dimensional in general.

3.    The numerical data are presented in terms of points and lines in graph.

4.    They help in studying the further mathematical and statistical relationship. 

5.    Construction of      graph        is      more difficult than that of diagram.  

6.    The graphs are only common to statistician and researcher.

7.    Presentation of frequency distribution in graph is more appropriate.

8.    They are frequently used.

Pareto Diagram

             It is a diagram containing bars and line graph in which the categorized responses are plotted in descending rank order of their frequencies by bars and cumulative frequencies by line graph. The categorical responses in descending order are taken along horizontal line, percentage cumulative frequencies are taken along right vertical line and frequencies are taken along left vertical line. Equal spaced bars of equal width are drawn on horizontal axis whose height are proportional to corresponding  frequencies, then percentage cumulative frequencies are plotted against mid points of vertical bars and joined by lines.

        It is used in statistical control of process and product quality.

        It is also useful tool for presenting categorical data particularly when the number of classifications or grouping increases.

Example: For a computing controlled lathe whose performance was below par, workers recorded the following causes and their frequencies;

 

 

Power

fluctuations

Controller stable

not

Operator error

Worn tool not replaced 

Other

6

22

 

13

5

2

        Solution

Arrange data in descending order and obtain frequencies and percentage cumulative frequencies as follow;

Categories

Frequency

Cumulative frequency

Percentage cumulative frequency

Controller not stable

22

22

46

Operator error

13

35

73

Power fluctuation

6

41

85

Worn tool not replaced

5

46

96

Others

2

48

100

 

 

 

Two-dimensional diagram

Bar diagrams are one dimensional diagram since the magnitude of the given data is represented through only one dimension. Generally, their magnitudes are measured only by either height or length of bars. In two dimensional diagrams, the lengths as well as the width of bars are considered. Thus, the magnitudes of the given data are presented by the area of the diagram.

The different types of two dimensional diagrams are rectangles, squares and circles. But here we consider only circle (i.e. Pie-diagram or Pie-chart)

Pie-diagram (Pie-chart) 

Pie diagram is another popular and widely used method for diagrammatic presentation in the form of a circle whose area represents the total value. It is a two dimensional diagram. It is also known as pie-chart or circular diagram or angular diagram. In this method all the given values are converted in terms of angles so that the sum of these all angles equal to 360 degree.

The main drawback of pie-diagram is if the negative values such as loss or deficit of the several small sectors are to be presented, pie-diagram cannot be used.

 The given values are converted into the angles in the following ways:

Total value = 100% = 360° (The angle at the centre of the circle = 360°)

Then, Central angle for any given value =  

Example1

Express the following data using pie-chart.

Items

Expenditure as percent of total

Food

50      

Clothing

15

Housing

10

Fuel

5

Education

10

Entertainment

5

Miscellaneous

5

Solution

         Since, the angle at the centre of the circle = 360°

         Here, total value = 360°

                  100% = 360°

         Suppose, x = 100360 × x°

                           = (3.6 × x) °

Calculation for pie-diagram

Items

% of exp.

Values in angle = (3.6 × x)

Food

50

3.6 × 50 = 180°

Clothing

15

3.6 × 15 = 54°

Housing

10

3.6 × 10 = 36°

Fuel

5

3.6 × 5 = 18°

Education

10

3.6 × 10 = 36°

Entertainment

5

3.6 × 5 = 18°

Miscellaneous

5

3.6 × 5 = 18°

Construct pie diagram for the following weekly family expenditures of two families F1 and F2.

Items

Family F1

(Income Rs. 400)

Family F2

(Income Rs. 900)

Food

200

350

Education

50

100

Fuel

50

150

Entertainment

20

90

Miscellaneous

80

210

Solution

         For family F1:

         Taking Rs. 400 = 360°

                  Re. 1 =  

         For food item, Rs. 200 =  × 200 = 180°

 In similar fashion, other angles are calculated.

         For family F2:

         Taking Rs. 900 = 360°

360 °

                  Re. 1 = 900

360 °   For food item, Rs. 350 = 900 × 350 = 140°

         Similarly others can be calculated.

Calculation for pie-diagram

Items

For family F1

For Family F2

 

Expenditure

Angle

Expenditure

Angle

Food

200

180°

350

 

140°

Education

50

45°

100

 

40°

Fuel

50

45°

150

 

60°

Entertainment

20

18°

90

 

36°

Miscellaneous

80

72°

210

 

84°

Total

400

360°

900

 

360°

Square root

20

 

30

 

 

Radius

20

r1 = 30 = 0.67

 

r2 =

 = 1.50

 

 

 

 

Graphical Presentation of Data

Like diagrammatic presentation of data another method of presenting this is a graphical presentation so that the information regarding the presented data can be easily understandable. Graphical presentation of data is technique of presenting the data visually in the form of graph. Even when data show complex relations among variables, graphs also make them much clear at a glance. Graphs facilitate to interpret and to analyze the data. They are much more helpful in depicting median, mode, skewness, kurtosis, correlation, regression and time series analysis etc.

Graph should be neat, simple, clear and self explained. It should also have self explained title. If necessary, footnotes, source notes and indices are also to be kept in appropriate place in the graph. Arrangements of scales are also the important part while constructing graphs.

Importance of graphical presentation of the data is as follows:

        The graph presents the data in such a fashion that it is easier for the user to understand.

        It gives attractive presentation to the statistical data as compared to the table. By looking at the graph, it is easier to notice the main features of the data at a glance in trends and fluctuations.

        It is time saving device.

        It facilitates comparison of data relating to two different periods of time or to two different regions.

        It does not require more knowledge of mathematics on the part of the user to understand the message of graph.

 

Generally graphs can be divided into two categories which are as follows.

(f)    Graph of time series

(g)   Graph of frequency distribution or frequency graph

1.      Graph of time series

This is a graphic presentation of chronological (varying according to time) data. Generally time variable is kept in x-axis and other considering variable is kept in y-axis. 

Graph of Frequency Distribution or Frequency Graphs

This is another category of graphs which shows the graphical relationship between the variable values and their corresponding frequencies of the data under consideration. The following are the main types of frequency graphs.

(a)   Histogram

(b)   Frequency polygon

(c)   Frequency curve

(d)   Cumulative frequency curve (or ogive)

Attempts have been made to discuss each of these along with suitable examples and illustrations.

Histogram

Histogram is most common, popular and widely used graphic presentation of frequency distribution. It consists of a set of adjacent vertical rectangles on X-axis with bases equal to the width of the corresponding class intervals and heights proportional to the corresponding frequency of each class so that the area of the rectangle is proportional to the frequency of the corresponding class.

 Histogram is used to describe numerical data which have been grouped into frequency, relative frequency or percentage distributions. Variable under consideration is kept along X-axis and frequency or relative frequency or percentage of frequency per class interval is kept in vertical y-axis. A histogram constructed keeping variables along X-axis and their corresponding relative frequency in Y-axis is called relative frequency histogram.

Histogram is useful to locate mode and it is also used to test the normality of the distribution.

Histogram with equal width classes

If the class intervals in the frequency distribution under consideration are of equal width then the vertical bars in the histogram are also of equal width. Such type of histogram is called histogram with equal width classes. The height of each rectangle will be the corresponding frequency of the each class interval. 

Example1

Construct a histogram from the following grouped frequency distribution.

Class interval

5-10

10-15

15-20

20-25

25-30

30-35 

Frequency

2

5

7

12

6

Solution

   Example 2     Class interval60  

 

Plot a histogram for the following frequency distribution and locate the mode.

Marks

0-20

20-40

40-60

60-80

80-100 

No.of students

10

22

35

28

 

Solution: This is the case of equal class size (or equal class width)

 

 

 

 

Mode (Mo) = 52 marks      Marks    60

 

   Example 3

Following is the distribution of marks obtained by 50 students. Plot a histogram for these data.

Marks

20-29

30-39

40-49

50-59

60-69

70-79

No. of students

5

10

20

8

6

1

Solution:

Correction factor =  

Converting the given inclusive class intervals into exclusive classes by subtracting correction factor from each lower limit and adding to each upper limit. 

Marks

19.5-29.5

29.5-39.5

39.5-49.5

49.5-59.5

59.5-69.5

69.5-79.5

No. of students

5

10

20

8

6

1

Then the histogram is plotted as follows:

 

 

 

 

Histogram 

 

 

Marks    Example 4

 Draw a histogram from the following distribution:

Mid-point:   5   10   15   20   25   30

Frequency:  1     3    5    10    5     3 Solution:

Since, mid. values are given, so at first class intervals are constructed.

Class size (h) = Difference between two successive mid values,

                   = 10-5                  h   = 5

                   

Subtract = 2.5 from the first middle value for lower limit of first class interval and add    to the same mid value for the upper limit of first class interval and so on. Other class intervals are constructed in the similar fashion as shown in the calculation table below:

Class interval

2.5-7.5

7.5-12.5

12.5-17.5

17.5-22.5

22.5-27.7

27.5-32.5 

frequency

1

3

5

10

5

 

 

Histogram with unequal width classes

If the widths of all class intervals under consideration are not of equal widths, then all class intervals are to be made with equal width class intervals and the corresponding frequencies are also to be adjusted so that the height of rectangles will be proportional to the ratio of the class frequency to the corresponding class size.

  Example 5

 

Represent the following daily wage distribution of 125 workers by means of histogram.

Wage (in Rs.)

10-20

20-

30

30-

40 

40-50

50-70

70-110 

No. of workers

8

19

28

18

20

20 

Solution:

This is the case of unequal class interval, so adjustment of the frequencies must be made. The class size of first four class intervals is 10, that of fifth is 20 which is double of 10. Also, the class size of sixth class is 40 which is 4 times of 10 and class size of last is 40 which is 4 times of 10. Their corresponding frequencies are adjusted as follows:

 

    Calculation of adjusted frequency

Wage (in Rs.)

10-20

20-

30

30-

40

40-

50

50-70

70-110

No. of workers (f )

8

19

28

18

20

20

Magnitude of class

10

10

10

10

20

40

Adjusted frequency

 (or height of histogram)

8

19

28

18

20/2          =

10

20/4 = 5

 

 

 

Wages 60   (Rs)

Frequency Polygon

Frequency polygon is another method of graphical presentation frequency distribution. A frequency polygon can be constructed generally by joining the mid-points of the top of the adjacent rectangles (histograms) by straight line. Frequency polygon can also be drawn without histogram. In this case, frequency polygon is constructed keeping the midpoint of each class interval in X-axis and the corresponding frequency in the vertical Y-axis. 

 

   Example 6

Prepare a histogram and frequency polygon from the following data:

Class interval

0-5

5-10

10-15

15-25

25-30

30-35

Frequency

4

8

15

20

12

2

Solution: 

Here, the width of the classes is unequal. The width of the classes is 5 and 10. The width of 15-25 is 10, which is 2 times the width of the other classes. So, the frequency of 15-25 is divided by 2 i.e. 20/2 = 10. It is the height of the rectangles of the class 15-25. The height of the rectangles of other classes is same as frequency.

Class interval60  

 

  Example 7

For the following frequency distribution, plot Frequency polygon without histogram.

Class

15.5-20.5

20.5-25.5

25.5-30.5

30.5-35.5

35.5-40.5

Frequency

2

7

14

5

3

Solution:

Class

15.5-20.5

20.5-25.5

25.5-30.5

30.5-35.5

35.5-40.5

Frequency

2

7

14

5

3

Mid. value

18

23

28

33

38

 

 

 

 

 

 

Frequency polygon without Histogram

 

Frequency Curve     Mid. value

A smooth free hand curve drawn through the vertices of a frequency polygon is known as the frequency curve. The frequency polygon is smoothened in such a way that the area enclosed by frequency curve is same as the frequency polygon but its shape must be smooth. 

Example 1

Construct a histogram, frequency polygon and frequency curve from the following data:

Class interval

0-5

5-10

10-15

15-25

25-30

30-35

Frequency

8

16

30

40

24

4

 

Solution: 

Here, the width of the classes is unequal. The width of the classes is 5 and 10. The width of 15-25 is 10, which is 2 times the width of the other classes. So, the frequency of 15-25 is divided by 2 i.e. 40/2 = 20. It is the height of the rectangles of the class 15-25. The height of the rectangles of other classes is same as frequency.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

         Class interval60  

Cumulative Frequency Curve (or ogive)

Cumulative frequency curve is a graphical presentation of cumulative frequency distribution. Cumulative frequency curve is also known as “Ogive”. In drawing Ogive, the points are plotted with cumulative frequency (c.f.) along y-axis and the corresponding class intervals along x-axis and joining them freely.   Relative cumulative frequency curve can also be drawn by keeping the relative cumulative frequencies in Yaxis and the variable values in the X-axis. Similarly cumulative percentage frequency curve can also be drawn keeping the cumulative percentage frequency in the vertical Y-axis.

Cumulative frequency curves are helpful to locate the partition values viz. quartiles, deciles, percentile and median. 

There are two types of cumulative curve (or ogive) which are as follows. (i) Less than cumulative frequency curve (or less than ogive)

(ii)  More than cumulative frequency curve (or more than ogive)

Less than Cumulative Frequency Curve

This is a cumulative frequency curve obtained by plotting the values keeping the upper limit of each class interval in the horizontal X-axis and their corresponding less than cumulative frequencies as the Y-axis. Less than relative cumulative percentage frequency curve (or less than relative frequency) can also be drawn in a similar fashion.

Example  

Construct the following given from the data given below        (a)          Less than cumulative frequency curve.

(b)           Less than relative cumulative frequency curve.

(c)           Less than cumulative percentage curve.

Period wise earnings of teacher (in Rs.)

100-200

200-300

300-400

400-500

500-600

600-700

700-800

Total

No. of teachers (f) 

5

10

22

30

16

12

15

110

Solution

(a)         Calculation for less than cumulative frequency curve.

Earnings (in Rs.)

No. of teachers less than cumulative frequency

Less than 200

5

Less than 300

15

Less than 400

37

Less than 500

67

Less than 600

83

Less than 700

95

Less than 800

110

 

 

Less than ogive curve (Less than cumulative frequency curve)

 

 

Period wise earning (Rs)

(b)         Calculation for less than relative cumulative frequency curve

Earnings (in Rs.)

No. of teachers 

(Less than c.f.)

No. of teachers 

(Less than c.f.)

Less than 200

5

0.05

Less than 300

15

0.14

Less than 400

37

0.34

Less than 500

67

0.61

Less than 600

83

0.75

Less than 700

95

0.86

Less than 800

110

1

 

 

Weekly wage (Rs) 

(c)         Calculation for less than cumulative percentage frequency curve.

 


 

Weekly wage (Rs)


Example  2

Draw a less than Ogive from the following data. Obtain the number of workers whose wages lies between Rs.600 and Rs.700

Weekly wage (Rs)  500-550    550-600       600-650     650-700     700-750

No. of workers             20             35                  45              30                 25

Solution: 

Weekly wage (Rs.)

No. of workers (Less than c.f.) 

Less than 550

20

Less than 600

55

Less than 650

100

Less than 700

130

Less than 750 

155

 

            Less than ogive curve (less than c.f. curve)

D7

From ogive curve, the number of workers whose wages lies between Rs.600 and Rs.700 is 84.

 

More than Cumulative Frequency Curve (or more than ogive)

This is another cumulative frequency curve obtained by plotting the values keeping the lower limit of each class interval in horizontal X-axis and their corresponding more than cumulative frequency in vertical Y-axis. More than relative cumulative frequency curve (or more than cumulative percentage frequency curve) can also be drawn in a similar manner keeping more than relative cumulative frequency (or more than cumulative percentage frequency) in vertical Y-axis and lower limit of C.I. in X-axis. Note that the foot of perpendicular drawn in X-axis from the point of intersection of less than ogive and more than ogive gives the median value 

Example 1

Given the following distribution, plot more than cumulative frequency curve.

Class

40-50

50-60

60-70

70-80

80-90

90-100

Frequency

4

8

4

24

6

4

Solution

         Calculation for more than cumulative frequency curve

Class

More than cumulative frequency

More than 40

50

More than 50

46

More than 60

38

More than 70

34

More than 80

10

More than 90

4

 

 

 

 

 

            More than ogive curve

 

 

 

 

 

 

 

 

 

Example 2

The following data represents the number of days during a year that the faculty of the college was absent from work. Plot less than ogive and more than ogive in a same plane and hence locate the median value.

Number of days

Number of employees

0-3

5

3-6

10

6-9

20

9-12

10

12-15

5

Total

50

Solution

         Calculation for less than ogive and more than ogive.

Number of days

No. of employee 

(less than c.f.)

No. of employee 

(more than c.f.)

Less than 3

5

50

Less than 6

15

45

Less than 9

35

35

 

No comments:

Post a Comment