Basic Concept of Statistics:
The word
“Statistics” has been derived from the Latin word “Status”, French word “Statistique”,
Italian word “Statista” and German
word “Statistik” each of which
means 'political state. States used
Statistics only to keep information for government purposes regarding the
population, ‘poverty or wealth’ of the country, the number of polices,
soldiers, fiscal policies etc. Statistics was regarded as the science of
statecraft. So it was used in collection of the information.
The word “Staistics” is used in singular as well as plural
sense. (i)
Quantitative information
of facts or simply data.
(ii) Statistical methods for handling
numerical data.
The first is
used in the plural sense. Statistics in the plural sense, it means the
quantitative information or the numerical set of data or numerical facts
collected systematically. The second is used in the singular sense. In the
singular sense, it means the statistical methods and techniques adopted for
collection, presentation, analysis and the interpretation of the numerical
data.
Definition of Statistics
There is not
a single definition of Statistics since the field of applicability has been
widening day by day. As the field of applicability is extended, the definition
of Statistics needs to be modified. Hence, it has been defined in a wider
domain. Some comprehensive definitions are as follows.
Definition of Statistics in Singular sense
In the
singular sense, it is also known as 'statistical method’ for the collection,
presentation, analysis and interpretation of numerical data.
The most
comprehensive definition given by Croxton and Cowden is:
"Statistics
may be defined as the science of the collection, presentation, analysis and
interpretation of numerical data."
Definition of Statistics in Plural
sense In the plural
sense, of all definitions Prof. Horace Secrist’s definition is the most
comprehensive. According to him
"Statistics
are aggregates of facts affected to a marked extent by multiplicity of causes,
numerically expressed, enumerated or estimated according to reasonable standard
of accuracy, collected in a systematic manner for a predetermined purpose and
placed in relation to each other."
Statistics
may also be classified into two parts, which are as follows.
Theoretical
Statistics or Mathematical Statistics Applied Statistics
Theoretical Statistics or mathematical statistics
Theoretical
Statistics can further be subdivided into two parts.
•
Descriptive
Statistics
•
Inferential
Statistics
Descriptive Statistics
Descriptive
Statistics merely describes the data and consists of methods and techniques
used in collection, organization, presentation of data using table and chats,
diagram, graph etc., summarizing data using measure of central tendency,
dispersion, skewness, kurtosis etc. and analysis of data in order to describe
various features and characteristics of such data is called descriptive
statistics. Summarized results obtained from descriptive statistics which
describe the data but cannot be used to generalize. Average, rates, ratios,
percentages etc. are the basic summary statistics or descriptive statistics for
summarizing the data depending on their nature. Hence, some most frequently
used descriptive statistical measures are measures of central tendency, measures
of dispersion, measures of skewness, measures of kurtosis etc.
Inferential Statistics
Inferential
Statistics deals with the methods of drawing ( or inferring ) conclusions about
the characteristics of the population based upon the results of the sample
taken from the population.
In other
words, Statistics which deals with sample selection from population and
statistical technique used to draw conclusion about population on the basis of
statistical measures obtained from sample is called inferential statistics.
It is used
in estimation of parameters and statistical testing of hypothesis.
Applied Statistics
This
consists of massive application of theoretical or mathematical Statistics in
the different areas such as biology, physics, engineering, astronomy, metrology, chemistry, medical
science, sociology, psychology, business, economics , agriculture,
Econometrics, bio-statistics or biometry,
etc. The statistical tools and methods are used in order to solve many
more practical problems in diversified area. Besides that applied statistics
has been used in decision making problems.
Functions of Statistics
Statistics
has been playing a vital role almost every area in the universe. Some of
important functions are listed below:
• To help classification of data.
• Statistics simplifies complexity.
• Statistics facilities for making
comparison.
• Statistics present facts in a definite form.
• To determine relationship between
different phenomena.
• To help in formulation of policies.
• Statistics helps in predicting or
forecasting future trends.
• Statistics helps in formulating and
testing hypothesis.
• To draw valid inferences or
conclusions.
Importance
and Scope of Statistics
In recent
years, the development in statistical studies has considerably increased its
scope and importance. So, there is
hardly any walk of life, which has not been affected by statistics. It has
become one of the indispensable parts in almost areas. It is used in the field of physical sciences, Biological sciences, Medical
Sciences, Industry, Economics sciences, Social Sciences, Management
Sciences, Information Technology, Engineering, Agriculture, Insurance,
Business, management, accounting, finance, marketing, production,
computer, information technology,
social sciences and many other areas.
Application Of Statistics In Engineering
Statistics
is a critical tool (an important tool)
for robustness analysis, measurement system error analysis, test data
analysis, probabilistic risk assessment, and many other fields in the
engineering world. The key to the success of teaching statistics to engineering
students is to make the statistics solution relevant to the engineering
problems they face.
Probability and
statistics are used throughout engineering. In electrical engineering, signals
and noise are analyzed by means of probability theory. Civil, mechanical, and
industrial engineers use statistics and probability to test and account for
variations in materials and goods. Chemical engineers use probability and
statistics to assess experimental data and control and improve chemical
processes. It is essential for today’s engineer to master these tools.
Variable
In
Statistics, a variable or characteristic may be defined as an attribute that
describe a person, place, thing or idea under study. Therefore, a
characteristic or measurement that may vary from one biological entity to
another or place to place or time to time is called variable. In other
words, a variable is a characteristic that varies from one person or
thing to another .Therefore, a quantitative or qualitative characteristic that
varies from observation
to observation in the same
group is called
a variable. For example, the variable may be height,
weight, age, blood pressure, pulse rate, blood sugar, temperature, gender,
ethnicity, religion, occupation, hair colour, eye colour, knowledge,
nationality, religion, pain etc. Generally, variables are denoted by X or Y or
Z.
Types of variable
The
statistical data can be divided into two broad categories. So, there are two types of variables
1. Categorical variable or Qualitative
variable
2. Numerical variable or Quantitative
variable
Qualitative variable
Qualitative
variable is a variable or characteristic which cannot be measured in
quantitative form (i.e. in number) but can only be identified by name or
categories. The qualitative variables
are just categorized. It is also called categorical variable. Categorical
variables take a value that is one of several possible categories. As naturally
measured, categorical variables have no numerical meaning .The data are
classified by counting the individuals having the same characteristics or
attributes and but not by measurement. For Examples: Hair color (black, grey
,white) , gender: male/female, religion , nationality , marital status,
disease: present/absent, eye color ( black, brown ,blue), vaccine: vaccinated/ not vaccinated , smoke:
smoking/not smoking, stages of cancer
(I, II, III, or IV), blood group ( A, B, AB, O) , degree of pain (minimal,
moderate, severe or unbearable) etc.
The
qualitative variable if it has only two categories is binary or dichotomous
variable such as male/female, married/single, live/death etc. Data of this type
are called nominal data and ordered categorical data such as pain:
mild/moderate/severe and cigarette consumption: non-smoker/light smoker/ heavy
smoker, socioeconomic status: lower, middle, upper are also called ordinal
data.
Qualitative
variable may be further classified into two categories
(i)
Nominal
variable
(ii)
Ordinal
variable
Nominal variable
Qualitative
variables which can be categorized into various categories such that the
numbers or symbols 0, 1, 2 etc. assigned have no numerical meaning are called
nominal variables. It describes the differences but not the differences between
the numbers. It does not satisfy ordinary arithmetic properties such as
addition, subtraction, multiplication, division etc.
For examples,
gender (male/female), marital status (married/single), profession (Teacher,
doctor, engineer, officer etc.), religion (Hindu, Muslim. Buddhist etc.), hair
color (black, grey, white) etc.
Ordinal variable: Qualitative variables which can be
ordered and ranked categorical data
are called ordinal variables. For examples, pain: mild/moderate/severe;
cigarette consumption: non-smoker/light smoker/ heavy smoker; socioeconomic
status: lower, middle, upper; attitude towards certain fact: positive, negative,
bad etc.
Quantitative variable: A quantitative variable is one that
can be measured and expressed numerically. Therefore, quantitative variable
describes the characteristics in term if a numerical value, which are expressed
in units of measurements. For example, height, weight, age. Income, pulse rate,
blood pressure, level of hemoglobin in blood, number of children in a family,
temperature records etc. These data may be represented by interval and ratio
scale. Quantitative Variables can be divided into two types based on the nature
of the characteristics. They are:
1. Discrete variable
2. Continuous variable
Discrete variable: A variable is said to be discrete if
it takes only countably many values (i.e. whole numbers). It is a variable
whose values are obtained by counting.
For example,
No. of goals scored in a football match.
No. of
children in a family : 1, 2, 3, 4, 6
WBC
count: 2ooo, 3010, 4060, 5050
Continuous variable: A variable is said to be continuous
if it takes all possible real values (i.e. whole number as well as fractional
values) within certain range. It is a variable whose values are obtained by
measuring. For examples: height, weight, age, income, pulse rate, blood
pressure, level of hemoglobin in blood, cholesterol etc.
i.e.
Weight of
children (in kg): 8, 9. 9.5, 9.8, 10,
10.3
Height of
patients (in feet): 4.6, 4.9, 5, 5.4,
5.6, 5.8, 6
Why it is important to know what type of variable is being measured
Categorical
variables or qualitative variables are analyzed using non parametric tests;
(those that do not assume a normal distribution). Examples of non-parametric
tests are Chi square test, fisher’s exact
test, Wilcoxon Matched pairs test (signed rank test) , Mann-Whitney test
(U test), Kruskal Wallis test( or H test), Sign test, Mc-Nemer test, Median
test etc.
Quantitative variables (more specially, continuous variables)
are analyzed using parametric tests (those that assume a normal distribution).
Examples of parametric tests are Z test, Student’s t test (or t test), and F
test (or variance ratio test).
Measurement and measurement scales
Measurement may be defined as the assignment of numbers to
objects or events according to certain rules. Measurement of statistical data
is essential for further statistical analysis. The choice of appropriate
statistical technique depends upon the type of data. The data will always form
one of four scales of measurement: nominal, ordinal, interval or ratio scale
.The qualitative data can be measured in nominal and ordinal scale whereas
quantitative data in interval and ratio scales. Therefore, there are generally
four types of measurement scales, which are as follows.
1. Nominal scale
2. Ordinal scale
3. Interval scale
4. Ratio scale
Nominal scale
This is the
simplest and lowest measurement scale. As its name, it consists of assigning
numbers or symbols to events just to classify them into various categories such
that the numbers assigned have no numerical meaning. It describes the
differences but not the differences between the numbers. It does not satisfy
ordinary arithmetic properties such as addition, subtraction, multiplication,
division etc.
For
example,
Numbers
assigned to sex : male, female
Religion:
Hindu, Buddhist, Muslim, Christian.
Occupation:
Teacher, Manager, Doctor, Businessman, Civil service
All
categories are mutually exclusive. Therefore, one cannot rank these and cannot
say that a male is a higher value than female or teacher is a higher value than
a manager. An important feature is that the different categories in no sense
better or worse than another. They are simply different from one another. The
symbols or numeric values such as 0 or 1 or 2 etc. are used just to identify
the characteristics and have no any other meaning.
Statistical measurement: Descriptive statistics associated
with nominal data are frequencies and percentages, proportions or
probabilities; tables and graphs: frequency tables, bar charts , pie charts are
used; for central tendency mode is used
but it cannot be averaged only number in each category can be counted;
dispersion can not be used i.e. mean and standard deviation can’t be used. For
inferential statistics, Chi square test is the most common test applicable to
nominal data , also McNemar test is used.
Ordinal scale
Ordinal
scale is slightly more sophisticated than nominal. All observations are not
only different from category to category. When they can be ranked or ordered
according to certain criterion or priorities or importance, then they are said
to be measured on an ordinal scale. In this scale, the categories are ordered
but differences cannot be determined i.e. distance between objects or ranks
cannot be measured. Ordinal scales represent numbers, letters, or any symbols
used to rank items. Simple arithmetic operations are not possible for ordinal
data but ordinal data can also be on a continuum like excellent, good, fair and
poor etc. Examples, symptoms of depression from a psychiatric assessment such
as none coded by ‘0’, mild ‘1’, moderate ‘2’, severe ‘3’, etc. Similarly,
Socioeconomic status : upper ,middle , lower ; Pain: mild , moderate, severe;
Stage of cancer: 1,2, 3,4 etc., Preference to different banks such as NMB,
Standard chartered , NABIL. Likert scale.
Statistical measurement: In ordinal data, frequency tables,
bar charts , pie charts are used; median is an appropriate measure of central
tendency, mode, frequencies and percentages can be used. Percentiles and quartiles, range, maximum ,
minimum, standard deviation, variance analysis are used for measuring
dispersion. In most cases, rank –order correlations i.e. Spearman’s rank
correlation coefficient can be used.
Because of the nature of data, only non -parametric tests can be used such as
Wilcoxon signed rank test, Mann Whiteny U test, Kruskall Wallis test.
Interval scale
The interval
scale is a more sophisticated scale than nominal and ordinal in that with this
scale not only it is possible to order measurement but also the distance
between any two measurements is known. In this scale, the numerals with
quantitative meaning are associated to the objects.
This scale assumes that the data have equal
intervals. The distances (or Differences) between two values or observations or
objects or ranks can be measured. But no inherent starting point i.e. no
minimum point. That is, there is no absolute zero and is only fixed by the
convention. Basic arithmetic operations addition and subtraction can be used
but the ratio of scores is not meaningful. Example, IQ, temperature; the temperature of a body either in Celsius (0C)
or Fahrenheit degree (0F). The difference between 20 degree
Celsius (680
F) and 25 degree Celsius (770F) is the same as the difference
between 5 degree Celsius (410F) and 10 degree Celsius (500F).
But the scale is not a Ratio Scale. This measurement scale shows equal
intervals but it cannot say that 10 degrees is twice as hot as 5 degrees
because there is no absolute zero.
Statistical measurement: In analyzing interval data, mean
can be appropriately used to measure central tendency, standard deviation is
widely used for dispersion, for correlation product moment correlation (i.e.
Karl Pearson’s correlation coefficient ) can be calculated and for inferential statistics Z test,
t test and F test can be used for significance testing.
Ratio scale
This is the
highest level of measurement and is used for quantitative data. Ratio scale is
also called continuous data. It includes all the properties of interval scale
as well as absolute zero value. The numbers associated to ratio scale are true
numbers with true zero. The ratio between two values on this scale is a
meaningful measure. Operations such as multiplication and division are
therefore meaningful. Temperature measured in degrees Celsius or Fahrenheit is
not a ratio variable because 0C does not mean there is no temperature. However,
temperature measured in Kelvin is a ratio variable as 0 Kelvin (often called
absolute zero) indicates that there is no temperature whatsoever. Other
examples of ratio variables include height, mass, distance, weight etc.
The name "ratio" reflects the fact
that you can use the ratio of measurements. So, for example, a distance of ten
meters is twice the distance of 5 meters. Various types of statistical analysis
and mathematical operation can be used on ratio data.
Statistical measurement: Geometric mean, harmonic mean,
mean, standard deviation, coefficient of variation, Karl Pearson’s correlation coefficient,
regression etc. and most parametric statistical tests such as t-test, Z-test,
F-test be used.
Data collection
By
definition of Statistics, data is an aggregate of facts which can numerically
be expressed. So, Data is one of the main sources of information. The first step
of statistical approach (or study) is collection of data. Collection of data
means the methods that are to be used for getting necessary information from
units under investigation. The process of getting necessary information from
the units under investigation is called collection of data.
There are
mainly two types of data on the basis of collection procedures. They are as
follows
1.
Primary
data
2.
Secondary
data Primary data
Primary data
are those fresh and original data, which are collected and recorded by the
Investigator or researcher. Therefore, the data which are originally collected
by investigator or researcher or an agent for the first time for the purpose of
statistical enquiry (or investigation) are known as primary data. They are the
first hand, unique, original, reliable and accurate in character. Primary data
are collected for specific purpose of study of the investigator or researcher.
Methods of collecting primary data
There are
various methods of collecting primary data.
1. Interview method
(a) Direct personal interview method
(b) Indirect personal interview method
(c) Telephone interview method
2. Information through correspondence
3. Mailed questionnaire method
4. Schedule sent through enumerators
5. Internet
6. Observation method.
Problems involved in collecting primary data
There are
various problems that have to be faced while collecting primary data. The
nature of problems depends upon the different situations. Some of the problems
are as follows.
•
Generally
there is lack of time and money while collecting primary data.
•
Transportation
facility is essential for the enumerators or investigators to and from the
research field.
This is one of the major problems for under developed
countries like Nepal.
There is high degree of non-response error from illiterate
respondents.
Secondary Data
Secondary
data are those data which have been already collected and analyzed by some earlier
agency for its own use; and later the same data are used by a different
agency. The data is a primary for those
persons or institutions that collect them but the same data become secondary
for another. Actually secondary data are the data, which are borrowed from
others who have collected them for some other purpose. This type of data is not
original in nature. The degree of accuracy of this type of data is
comparatively less than that of the primary data.
Sources of Secondary Data
In most of the
studies the investigator finds it impracticable to collect first-hand
information on all related issues and as such he makes use of the data
collected by others. In order to collect secondary data, the following sources
may be used.
1 Published sources, and
2 Unpublished sources
Published Sources
The various
sources of published data are:
•
Reports
and publications of ministries, departments of the government.
•
Reports
and publications of reputed INGO’S such as UNDP, ADB, UNESCO, WHO, World Banks
etc.
•
Reports
and publications of reliable NGO’S, NDHS (Nepal Demographic Health Survey),
Journals, periodicals etc.
•
Publications
brought out by research agencies, research scholars, etc.
Unpublished Sources
All
statistical material is not always published.
There are various sources of unpublished data such as:
•
Records
maintained by various Government and private offices.
•
Records
maintained by research institutions, research scholars, etc
•
Records
updated by the departments institutions for their internal purpose.
Merits of secondary data
•
It
saves time and cost.
•
The
scope of inquiry can be increased in terms of area and time period to be
covered.
•
Much
of the secondary data available has been collected for many years and therefore
it can be used to plot trends.
•
It
helps the government in making decisions and planning future
policy
Demerits of secondary data
•
Data
may not be in the exact form of the requirement of the researcher.
•
Some
information is often omitted or some categories are pooled.
Advantages of Secondary Data
1. It saves time and cost.
2. If specially trained persons collect
it, the quality of secondary data is better.
3. The scope of inquiry can be increased
in terms of area and time period to be covered.
Disadvantages of Secondary data
1.
Data
may not be in the exact form of the requirement of the researcher.
2.
Some
information is often omitted or some categories are pooled.
Difference between Primary and Secondary data
The
difference between primary and secondary data is basically depends on the mode
of collection of data. The data which are primary for one agency is treated as
secondary for other and vice versa. However, the main differences between
primary and secondary data are as follows:
Primary data |
Secondary data |
Primary data are original in the sense that they are
personally collected by the investigator or researcher involving himself /
herself. |
Secondary data are not original in the sense that they are
collected by someone other than the investigator or researcher. |
Primary data collection is more expensive and exhaustive. |
Secondary data are readily available at less expense. |
Primary data are collected as per requirement of the
investigator. |
Secondary data might have been collected with different
objectives. |
Primary data may be influenced by
personal prejudice of the investigator. |
Secondary data may not be influenced by the personal
prejudice of the investigator. |
Presentation and Classification of Data: Stem-and-Leaf Display
Classification
Data
collected from the primary sources, obviously enough, but are in a raw form,
voluminous, unorganized mass of figures and are not fit for any statistical
treatment (i.e. analysis and interpretation).
These unwieldy, ungrouped and shapeless masses of collected data are not
easy to handle and are not capable of interpretation. So, after collecting and
editing the data, the next step is to condense or organize and simplify them in
such a way that irrelevant details are eliminated and their significant features
stand out prominently.
Therefore,
the appropriate procedure for this is the classification and presentation of
data.
Classification of data:
Classification
is the process of arranging the primary data into different groups or classes
according to their similarities. Classification of data helps to change the
voluminous heterogeneous data into homogenous groups so that its important
characteristics can easily be grasped. It also makes the data more useful for
further processing such as tabulation, analysis and interpretation. Thus,
Classification is the first step in tabulation because the items with
similarities must be brought together before the data are presented in the form
of a table.
Objectives of classification
The main objectives of classification
are The classified data is more easily
understood.
•
It
presents the fact into a simpler form.
•
It
condenses the huge mass of data by eliminating irrelevant details so that
similarities and dissimilarities can easily be brought out.
•
It
facilitates for quick comparison.
•
It
helps for further statistical treatment such as average, dispersion etc.
•
It
detects the error easily.
•
It
prepares the basis of tabulation.
Types of classification
The
collected data can be classified or grouped as follows:
•
Geographical
classification
•
Chronological
classification
•
Qualitative
classification
•
Quantitative
classification Geographical
classification:
In this classification, the data are
classified according to place, area, region etc. For example, The population
density in different region of Nepal
Region |
Population Density
(per sq. km.) |
Eastern Central Western Mid –western Far-western |
188 293 155 71 112 |
Chronological classification
A type of
classification in which, the data are classified according to variation of
time. The time series data are good examples of chronological classification.
E.g.
Population Growth rate
Year |
Growth Rate (In Percentage) |
1961 |
2.10 |
1971 |
2.07 |
1981 |
2.24 |
1991 |
2.66 |
2001 |
1.65 |
Qualitative classification
Classification
is said to be qualitative when the data are classified on the basis of some
attribute or quality or descriptive characteristics, which are not capable of
being described numerically. These types of data are known as categorical data
or qualitative data. For example sex, nationality, honesty, color of eye,
religions etc.
A
classification with two sub divisions with one attribute is known as simple or
'two fold' classification. If more than one attribute is to be studied
simultaneously, the data should be divided into a number of classes and this
classification is known as 'manifold' classification. Simple Classification
Manifold Classification
Quantitative classification
Classification
is said to be quantitative when the data are expressed numerically. In other
words, classification of data according to some characteristics which can be
measured quantitatively is known as quantitative classification. These types of
data are known as numerical data or quantitative data. Height, weight, age,
profit, income, pulse rate, sales, turnover, blood pressure etc. are some of
examples of this type of data. e.g.
Height in feet |
No. of persons |
4 4.8 5 5.5 5.10 6 |
5 10 7 18 6 3 |
Stem-and-Leaf Display
Stem and
leaf display is another graphical device that is useful for representing
quantitative data sets. It became more commonly used in 1980 after the
publication of John Turkey’s book explanatory data analysis (EDA) in 1977 A.D.
Stem-and-leaf
display is another form of the presentation of data. But for it, the
observations must first be arranged in ascending order (i.e. an ordered array).
Steam- and- leaf display or Stem- and- leaf plot is a valuable statistical tool
for organizing a set of statistical data. The basic idea of this display is
based on the analogy of plants. Therefore, to construct a stem-and-leaf
display, we partition each measurement into two parts. The first part is called
stem and second part is called leaf. The greatest common place value of data is
used to form the stem and the next greatest common place is used to form
leaves. Stem-and-leaf display bears a strong resemblance to a histogram and is
helpful to understand how the values are distributed and where the
concentrations of the data occur over the range of observations in the data
set.
Stem and
leaf display provides how the data are distributed and where the concentration
of the data occur.
Thus,
Stem-and-leaf display helps to have an idea about the shape of the distribution
of the given data values.
Example 1
Following
are the weights (in Kgs.) of 40 students. Construct the stem and leaf display
from the following data given below.
85 |
50 |
48 |
45 |
54 |
52 |
54 |
77 |
63 |
65 |
85 |
86 |
87 |
66 |
67 |
81 |
82 |
83 |
66 |
68 |
67 |
68 |
69 |
55 |
56 |
91 |
92 |
66 |
68 |
69 |
84 |
45 |
47 |
55 |
77 |
55 |
54 |
71 |
72 |
73 |
|
|
|
|
|
Solution
Arranging the given data in ascending order of magnitude
(i.e. an ordered array) is as follows:
41 |
45 |
45 |
45 |
48 |
50 |
52 |
54 |
54 |
54 |
55 |
55 |
55 |
56 |
63 |
65 |
66 |
66 |
66 |
67 |
67 |
68 |
68 |
68 |
69 |
69 |
71 |
72 |
73 |
77 |
81 |
82 |
83 |
84 |
85 |
85 |
86 |
87 |
91 |
92 |
|
|
|
|
|
The stem and leaf display is as follows.
Stem |
Leaves |
4 |
1 5 5
5 8 |
5 |
0 2 4
4 4 5
5 5 6 |
6 |
3 5 6
6 6 7
7 8 8
8 9 9 |
7 |
1 2 3
7 |
8 |
1 2 3
4 5 5
6 7 |
9 |
1 2 |
Example 2
The following data represent the annual expenses (in thousand
of rupees) on food items in a city.
13.8, 14.1,
14.7, 15.2, 12.8,
15.6, 14.9, 16.7,
19.2, 14.9, 14.9,
14.9, 15.2, 15.9, 15.2, 14.8,
19.1, 14.6,
18.0, 14.9, 14.2,
14.1, 15.3, 15.3, 15.5
18.0, 17.2, 17.2,
14.1, 14.5, 18.0, 14.4,
14.2, 14.6, 14.2, 14.8.
a. Place the data into an ordered array.
b. Construct the stem and leaf display.
c. Which of these two displays seems to
provide more information? Discuss.
Solution: (a) Arranging
the given data in ordered array (i.e. in ascending order) as follows:
12.8, 13.8, 14.1,
14.1, 14.1, 14.2,
14.2, 14.2, 14.4,
14.5, 14.6, 14.6,
14.7, 14.8, 14.8,
14.8, 14.9, 14.9,
14.9, 14.9, 14.9,
15.2, 15.2, 15.2,
15.3, 15.5, 15.6,
15.9, 16.7, 17.2,
17.2, 18.0, 18.0, 18.0,
19.1, 19.2
b) The stem and leaf display is as
follows.
Stem |
Leaves (0.1) |
12 |
8 |
13 |
8 |
14 |
1 1 1
2 2 2
4 5 6
6 7 8
8 8 9
9 9 9
9 |
15 |
2 2 2
3 5 6
9 |
16 |
7 |
17 18 19 |
2 2 0
0 1
1 |
c) An ordered array provides only how
the data are distributed i.e. an ordered array only shows the ordered values from the smallest value to
the largest but stem and leaf display provides two information: i) How the Data
or (values) are distributed.
i.e. Lowest annual expenses = Rs. 12 (in thousands) Largest
annual expenses = Rs.19.1 (in thousands)
ii) And,
Where the concentration of the data occurs and entirely the shape of the
distribution i.e. the most of the annual expenses is Rs.14.9 (in thousands) since
it repeats maximum number of times i.e. 5 times.
Therefore, the stem and leaf display provides more
information than ordered array.
Example 3:
Construct
stem and leaf display from the following data
5.35, 4.75,
4.30, 5.47, 4.85, 6.62, 3.54, 4.87, 6.26, 5.48, 7.27, 8.45, 6.05, 4.76, 5.91
Solution: Arranging the given data in
ascending order
3.54, 4.30, 4.75, 4.76, 4.85, 4.87, 4.87, 5.35, 5.47, 5.48,
5.91, 6.05, 6.26, 6.26, 6.62, 7.27, 8.45 Round the decimal in one decimal place
then above data becomes:
3.5, 4.3,
4.8, 4.8, 4.9, 5.4, 5.5, 5.5, 5.9, 6.1, 6.3, 6.6, 7.3, 8.5
Stem and leaf display
Stem |
Leaves (0.1) |
3 |
5 |
4 |
3 8 8
9 9 |
5 |
4 5 5
9 |
6 7 8 |
1 3
7 6 3 5 |
Example 4
From the following stem-and-leaf display regarding the age
(in years) of Glaucoma patients treated in Tilganga Eye Institute, Kathmandu,
construct a histogram using class width of 10.
Stem |
Leaves |
4 |
0 2 3
4 6 |
5 |
0 1 2
5 5 7 |
6 |
0 1 2
3 5 5 5 7
8 9 9 |
7 |
1 2 4
5 8 |
8 |
0 4 7 |
a. What percentage of cases is below 60?
b. In which age group is the
concentration of values lie?
c. Construct also the relative
percentage frequency distribution.
d. What percentage of patients lies
above 70?
Solution:
a)
Total
number of cases = 33
Number of cases below 60 = 11
Percentage of cases below 60 = = 33.333 %
b)
The
concentration of values lie in age group 60 to 69.
c)
Relative
percentage frequency =
Construction of relative percentage frequency distribution
Age (in years) |
No. of Glaucoma patients
(f) |
Relative percentage Frequency |
40-50 50-60 60-70 70-80 80-90 90-100 |
5 6 11 5 3 3 |
|
|
N = 33 |
|
d)
Number
of patients lies above 70 = 11
Percentage
of patients lies above 70 = %
Variable or
variate (Quantitative variable)
Any
quantitative characteristic under study is known as quantitative variable (or
variate). Generally, variables are denoted by X or Y or Z. The values taken by
the variable are known as variate values. Basically there are two types of
variables.
1.
Discrete
variable
2.
Continuous
variable
Discrete variable: A variable is said to be discrete if
it takes only countably many values (i.e. only whole numbers values). For
example: no. of goals in a football match, no. of children in a family, WBC
count, no. of bed in hospital wards etc.
Continuous variable: A variable is said to be continuous
if it takes all possible real values (whole number as well as fractional
values) within a certain range or limits. For examples: height, weight, age,
income, pulse rate, blood pressure, level of hemoglobin in blood, cholesterol,
temperature, marks obtained by students etc.
Raw Data
Raw data, also known as primary data, is data collected from
a source. Raw data are neither arranged in ascending order nor in descending
order of magnitude. If data are arranged in either ascending order or
descending order is known as arranged data or “Ordered array”.
Frequency
The number
of times which a variate value occurs is known as frequency. It is generally denoted by (f).
Tally Bars (or Tally Marks)
The number
of repetition of a variable value denoted by vertical lines (| |) are known as
tally bars or tally marks. It facilitates to count the huge mass of repeated
data.
Frequency Distribution
A table
consisting of variable values and along with their corresponding frequencies is
known as frequency distribution. It is a summary table in which data (values of
variable) are arranged numerically ordered with their corresponding
frequencies. Example:
Variable |
15 |
20 |
10 |
25 |
Frequency |
2 |
7 |
5 |
3 |
Types of frequency distribution
The distribution having only one variable is called
univariate frequency distribution. A univariate frequency distribution further
can be classified into three categories:
1.
Individual
series
2.
Discrete
series or Discrete frequency distribution.
3.
Continuous
series or continuous frequency distribution.
Individual series
The series
in which items (i.e. values of variable) or data are listed after observations
in either organized or unorganized form is called individual series.
Examples: (i) Marks obtained by a student in different
subjects
Subject |
Maths |
Statistics |
Account |
Computer |
Economics |
Finance |
Marks |
55 |
61 |
55 |
58 |
62 |
50 |
(ii) The marks obtained by 20 students in a
subject are given below:
23, 25, 69,
15, 70, 80, 50, 30, 25, 23, 50, 90, 70, 70, 80, 25, 25, 85, 80, 23
Discrete series or Discrete
frequency distribution
The series formed from a discrete variable along with their
corresponding frequencies is known as discrete series or discrete frequency
distribution. For example:
No. of goals scored
|
0 |
1 |
2 |
3 |
4 |
No. of matches |
5 |
3 |
8 |
4 |
2 |
Continuous series or Continuous frequency distribution or Grouped
frequency distribution
The series
formed from a continuous variable with their corresponding frequencies is known
as continuous series. It is also called continuous frequency distribution or
grouped frequency distribution. In this frequency distribution, the variable
can take all possible real values (whole number as well as fractional values)
within a certain range or limits. These limits may or not may be equal.
Income (in ‘000’ Rs.) |
30-40 |
40-50 |
50-60 |
60-70 |
70-80 |
80-90 |
No. of persons |
8 |
13 |
15 |
4 |
6 |
1 |
Ungrouped Frequency Distribution
A table
consisting of variable values and their corresponding frequencies is known as
ungrouped frequency distribution. In this distribution, data are not classified
(or grouped) into different classes or groups. For example:
Marks |
20 |
30 |
40 |
45 |
No. of students |
2 |
7 |
5 |
3 |
Grouped Frequency Distribution
Classification
of data into different classes or groups according to their characteristics is
known as grouped data. Therefore, a table consisting of grouped or classified
data and their corresponding frequencies is known as grouped frequency
distribution.
Example:
Income (in 000 Rs.) |
10-20 |
20-30 |
30-40 |
40-50 |
No. of employees (f) |
5 |
8 |
6 |
4 |
•
Class size: Difference between upper limit and
lower limit of each class interval is also called class size or width of class
or magnitude of class interval.
•
Class Frequency: The frequency for each class
interval is called class frequency.
•
Mid-Point (mid. value) of Class
Interval: It is
defined as the average between lower limit and upper limit of each class
interval. For example: Referring to above example.
Mid-point
for first class = = 15.
Similarly, others can be calculated.
General Guidelines for Classification of Data
There is no
hard and fast rule to classify the data. However the following points should be
kept in mind.
•
Classes
should be clearly defined.
•
Width
of each class interval in the distribution should be equal as far as possible.
•
As
much as possible, open-ended classes should be avoided.
•
Number
of classes should be neither too large nor too small (i.e. generally 5 to 15)
According to H.A. Struges, the approximate number of classes (k) is calculated
by
(i) Approximate no. of classes (k ) = 1+ 3.322 log n
& (ii) width of class (h) =
Where n = number of observations.
L = largest
value
S = Smallest
value
•
Class
should be non-overlapping.
•
Class
frequencies should be obtained by using tally bars or tally mark.
Example 1
If class size (h) is given:
The
following are the marks obtained by 60 students in Mathematics. Construct a
grouped frequency distribution starting from 10 so that the width of each class
interval is also 10.
52 |
44 |
38 |
59 |
51 |
71 |
20 |
44 |
30 |
47 |
40 |
63 |
39 |
59 |
52 |
72 |
23 |
45 |
39 |
60 |
54 |
30 |
47 |
40 |
14 |
73 |
25 |
45 |
39 |
61 |
52 |
73 |
64 |
55 |
30 |
47 |
28 |
46 |
39 |
61 |
52 |
74 |
28 |
46 |
40 |
65 |
55 |
33 |
29 |
62 |
54 |
28 |
46 |
40 |
62 |
54 |
47 |
40 |
65 |
55 |
Solution:
In this data
the smallest value of marks is 14 and the largest value is 74. The grouped
frequency distribution of marks obtained by 60 students is as follows.
Marks |
Tally bars |
Number of Students (f) |
10-20 |
| |
1 |
20-30 |
|
7 |
30-40 |
|
9 |
40-50 |
|
16 |
50-60 |
|
13 |
60-70 |
|
9 |
70-80 |
|
5 |
|
|
N = Σf = 60 |
Example 2
If class size (h) is not given:
The weights
in grams of 50 apples, picked from a box are as follows:
110 103
89 75 98
121 110 108
93 128 185
123 113 92
86 70 126
78 13 120
29 119 105
120 100 116
85 99 114
189 205 87
111 141 136
123 90 115
128 160 78
90 107 81
137 75 84
104 109 115
Construct
frequency distribution table with suitable number of classes.
Solution: Since, class size (h) is not given,
therefore at first it needs to find the approximate number of class intervals
(k) and class size (h)
Number of
observations, n = 50, S = smallest value = 70, L = Largest value = 205
By Struges
formula,
Number of classes, K = 1 + 3.322log n
= 1 + 3.322 log50 = 1+3.322× 1.6989 = 6.64 ,
R = 205 – 70 = 135.
Class width
or class size, h =
Weights (in gms.) |
Tally bars |
No.of apples
(f) |
70-90 90-110 110-130 130-150 150-170 170-190 190-210 |
|||| | || | |
11 13 18 4 1 2 1 |
|
|
N = ∑ f = 50 |
Example 3
If
class size (h) is not given:
The birth weights (kg) of 30 children were recorded as
follows:
2.0 |
2.1 |
2.3 |
3.0 |
3.1 |
2.7 |
2.8 |
3.5 |
3.1 |
3.7 |
4.0 |
2.3 |
3.5 |
4.2 |
3.7 |
3.2 |
2.7 |
2.5 |
2.7 |
3.8 |
3.1 |
3.0 |
2.6 |
2.8 |
2.9 |
3.5 |
4.1 |
3.9 |
2.8 |
2.2 |
Prepare a grouped frequency distribution for above data.
Solution: Since, class size (h) is not given
Total no.
of observation (n) = 30
Smallest
value or observation (S) = 2
Largest
value or observation (L) = 4.2
By Struges formula,
Number of classes, K = 1 + 3.322log n = 1 + 3.322 log30
= 1+3.3221.477121=5.906996 ≈ 6
∴ Class size or class width (h) =
Now,
the frequency distribution is shown below as:
Birth weight |
Tally bar |
Frequency (f) |
2.0 – 2.4 |
|
5 |
2.4 – 2.8 |
|
5 |
2.8 – 3.2 |
|
9 |
3.2 – 3.6 |
|||| |
4 |
3.6 – 4.0 |
|||| |
4 |
4.0 – 4.4 |
||| |
3 |
Note: This rule should not be regarded as
final but should be considered as a guide only.
The formula
is; K = 1 + 3.322log n
Where, K is
the number of classes and n is the number of values or observations in the data
set under consideration.
Size of the
class intervals may be decided by using the following formula.
R
h
= K
Where, R = Range = Largest item – Smallest item under
consideration. K = the
number of class intervals.
Types of Class Intervals
Exclusive type of class interval: The class intervals in which the
upper limit of preceding class is the lower limit of succeeding class so as to
make continuous without any gap are known as exclusive class intervals. In
exclusive class intervals, the upper limit of each class interval is excluded
in that class but included in the next class.
For example:
Marks |
|
60-70 |
70-80 |
80-90 |
90-100 |
No. students |
of |
10 |
15 |
5 |
2 |
Inclusive type of class interval: If the upper limit as well as the
lower limit of each class interval is included in that class interval, then
class intervals are known as inclusive type of class interval. In this type of
class intervals the upper limit of preceding class is not the lower limit of
succeeding class.
The width of this type of class interval = Lower limit of
next C.I. – Lower limit of previous C.I. For example:
Marks |
60-69 |
70-79 |
80-89 |
90-99 |
No. of students |
21 |
10 |
6 |
3 |
Conversion of Inclusive Class Interval to Exclusive Class Interval
Generally,
in statistical computation, we prefer exclusive class interval. To convert
inclusive class interval to exclusive class interval, let us define Correction
factor as below.
Correction
Factor, C.F. =
Now, real lower limit
= Lower limit – C.F. real upper
limit = Upper limit + C.F.
Example 4
Convert the inclusive frequency distribution into exclusive
frequency distribution from the data given below.
Marks |
50-59 |
60-69 |
70-79 |
80-89 |
90-99 |
No. of students |
10 |
15 |
30 |
20 |
12 |
Solution:
Here,
Correction factor, C.F. =
= = = 0.5
Real lower
limit = Lower limit – C.F. = Lower limit – 0.5
Real upper limit = Upper limit + C.F. =
Upper limit + 0.5
Then, exclusive frequency distribution is given below
Marks |
49.5-59.5 |
59.5-69.5 |
69.5-79.5 |
79.5-89.5 |
89.5-99.5 |
No. of students |
10 |
15 |
30 |
20 |
12 |
Example 5
Convert the following data into exclusive class:
Interval |
10-14 |
15-19 |
20-24 |
25-29 |
30-34 |
Frequency |
11 |
15 |
8 |
9 |
8 |
Solution: Correction factor, C.F.
= Then, exclusive frequency d
istribution is gi ven below |
|
|
|||
Interval |
9.5-14.5 |
14.5-19.5 |
19.5-24.5 |
24.5-29.5 |
29.5-34.5 |
Frequency |
11 |
15 |
8 |
9 |
8 |
Open-ended classes: If the lower limit of the first
class or the upper limit of the last class or both are not specified, then such
class is called open-ended class.
For
examples:
(a)
(b)
(c)
Marks |
Below 50 |
50-60 |
60-70 |
|
70-80 |
|
No. of students |
10 |
20
|
12
|
|
4
|
|
|
|
|
|
|
||
Marks |
Below 60 |
60-70 |
70-80 |
Above 80 |
||
No. of students |
10 |
20
|
12
|
4
|
||
|
|
|
|
|
||
Marks |
|
50-60 |
60-70 |
70-80 |
Above 80 |
|
No. of students |
|
10
|
20
|
12
|
4
|
Cumulative Frequency Distribution
After
arranging the classes (Variate values) in ascending order of magnitude, the
frequency distribution in which the frequencies are cumulated either from top
to bottom or bottom to top is called cumulative frequency distribution. There
are two types of cumulative frequency distribution.
(a) Less than cumulative frequency distribution (b) More than cumulative frequency distribution.
Less than cumulative frequency
distribution
After
arranging the classes (Variate values) in ascending order of magnitude, the
frequency distribution in which the frequencies are cumulated from top to
bottom is called less than cumulative frequency distribution.
Example
Marks |
No .of
students (Less than c.f.) |
Less than 10 Less than 20 Less than 30 Less than 40 |
5 12 20 22 |
Less than 50 |
30 |
OR
Marks (Less than)
|
10 |
20 |
30 |
40 |
50 |
No. of students |
5 |
12 |
20 |
22 |
30 |
Conversion
of above less than cumulative frequency distribution into simple frequency
distribution
Marks |
No .of students (f) |
0 – 10 20 -20 20 -30 30 -40 40 -50 |
5 12-5=7 20-12=8 22-20=2 30-22=8 |
More than cumulative frequency
distribution
After
arranging the classes (Variate values) in ascending order of magnitude, the
frequency distribution in which the frequencies are cumulated from bottom to
top is called less than cumulative frequency distribution. Example:
Income (in Rs.) |
‘000’ |
No .of persons (More than
c.f.) |
Above 10 Above 20 Above 30 Above 40 Above 50 |
|
35 27 20 8 3 |
OR
Income (in ‘000’ Rs.) (More than) |
10 |
20 |
30 |
40 |
50 |
No. of persons
|
35 |
27 |
20 |
8 |
3 |
Conversion
of above more than cumulative frequency distribution into simple frequency
distribution
Income Rs.) |
(in ‘000’ |
No .of persons (f) |
0 – 10 20 -20 20 -30 30 -40 40 -50 |
|
35-27=8 27-20=7 20-8=12 8-3=5 3 |
Relative Frequency
Relative
frequency is simply the proportion (ratio) of the observations belonging to the
class with total observations. That is,
Frequency of
class
Relative Frequency of class = = N
If the
relative class frequency is multiplied by 100, then it is called the percent
frequency of a class.
Relative
percentage Frequency of class =
Example 6
Prepare simple frequency distribution, relative frequency distribution, percentage frequency
distribution and more than cumulative frequency distribution from the following
data:
Weight(kg) |
Below 10 |
below 20 |
below 30 |
below 40 |
below 50 |
No.of children |
5 |
9 |
12 |
15 |
19 |
OR
Weight ( in kg) (Less than ) |
10 |
20 |
30 |
40 |
50 |
No.of children |
5 |
9 |
12 |
15 |
19 |
Solution:
Conversion
of above less than cumulative frequency distribution into simple frequency
distribution Simple frequency
distribution
Weight (in kg) |
No. of children (f) |
0-10 10-20 20-30 30-40 40-50 |
5 9-5= 4 12-9= 3 15-12 = 3 19-15= 4 |
|
N = ∑ f = 19 |
Since,
Relative Frequency of class =
& percentage Frequency of class =
Relative frequency distribution and percentage frequency distribution
Weight (in kg) |
No. of children (f) |
Relative frequency |
Percentage frequency |
0-10 10-20 20-30 30-40 |
5 4 3 3 |
5/19 = 0.26 4/19 = 0.21 3/19 = 0.16 3/19 = 0.16 |
26 21 16 16 |
40-50 |
4 |
4/19 = 0.21 |
21 |
|
N = ∑ f = 19 |
1 |
100 |
More than cumulative
frequency distribution
Weight (in kg) |
More than c.f. |
More than 0 More than 10 More than 20 More than 30 More than 40 |
14+5 = 19 10+4= 14 7+3 = 10 4+3 = 7 4 |
Example 7
Calculate
relative frequency and percent frequency for each class from the following
frequency distribution.
Blood pressure( mmHg) |
125-135 |
135-145 |
145-155 |
155-165 |
165-175 |
Total |
No. of patients (f ) |
10 |
15 |
37 |
12 |
6 |
N = 80 |
Solution
We have,
Frequency of class
Relative frequency = Total observations
Percent frequency = Relative frequency × 100
Calculation of relative frequency and percent frequency:
lood pressure (in mmHg) |
No. of patient (f) |
s Relative
Frequency |
Relative
percentage Frequency |
125-135 |
10 |
10/80 = 0.13 |
13 |
135-145 |
15 |
15/80 = 0.19 |
19 |
145-155 |
37 |
37/80 = 0.47 |
47 |
155-165 |
12 |
12/80 = 0.15 |
15 |
165-175 |
6 |
6/80 = 0.06 |
6 |
|
N = 80 |
1 |
100 |
Example 8
Following are the marks in statistics obtained by 25 students
out of 50 full marks of Pokhara University.
40, 25, 30, 22, 35, 18, 28, 38, 45, 34, 31, 15, 24, 19, 20,
26, 09, 18, 07, 29, 35, 43, 41, 40, 30
Construct a grouped frequency distribution with suitable
class size (or suitable class width) and prepares relative frequency
distribution.
Solution:
Since, class
size or class width (h) is not given, therefore at first number of classes (k)
and class size (h) are calculated.
Number of
observations, n = 25, S = smallest value = 7,
L = Largest value = 45
By Struges formula,
Number of
classes, K = 1 + 3.322log n = 1 + 3.322 log25 = 1+3.3221.39794 =5.643≈ 6
Class size or class width, h =
MARKS |
Tally bars |
No.of students (f) |
Relative frequency = |
7-14 14-21 21-28 28-35 35-42 42-49 |
||
||||
|| |
2 5 4 6 6 2 |
0.08 0.2 0.16 0.24 0.240 0.08 |
|
|
N = ∑ f =
25 |
1 |
Example
9
Find out class intervals from the following frequency
distribution.
Mid. point |
5 |
10 |
15 |
20 |
25 |
30 |
Frequency |
1 |
3 |
5 |
10 |
2 |
1 |
Solution
Class size or width of class (h) = Difference between two
successive mid. values
= 10 5
h = 5
Subtract the value of from the mid. value to
get lower limit and add the value of to the same mid.
value to get the upper limit and so on.
Hence, the grouped frequency distribution of the given
problem is as follows.
Class interval |
2.5-7.5 |
7.5-12.5 |
12.5-17.5 |
17.5-22.5 |
22.5-27.5 |
27.5-32.5 |
Frequency |
1 |
3 |
5 |
10 |
2 |
1 |
Diagrammatic presentation of data
After
classification and tabulation of numerical data for statistical enquiry, the
next method of presenting the numerical data is the use of diagrams.
Presentation of statistical data through the use of diagram is known as
diagrammatic presentation.
Diagrams are
nothing but the presentation of statistical data in the form of geometrical
figures like bars, rectangles, circles etc.
General rules for constructing of diagrams
There are no
hard and fast rules for constructing diagrams. However, the following points
have to be kept in mind while constructing diagrams.
Title: The title should be brief
self-explanatory and clear. The title should be written either at the top or at
the bottom of the diagram. Generally, it is kept at the top of the diagram.
Choice of a diagram: Proper diagram should be chosen
depending upon the nature of data.
Simplicity: As much as possible, the diagram
should be simple and easy to understand.
Proportion between dimensions: A proper proportion between the
dimensions (i.e. height and weight) of the diagram should be maintained.
Neatness and cleanliness: As diagrams are visual aids to
present the numerical data, the diagrams should be neat and clean.
Selection of scale: The most important thing is the
selection of a proper scale. The scale (i.e. horizontal and vertical scales)
should clearly be shown on the diagram.
Choice of a diagram: Proper diagram should be chosen
depending upon the nature of data.
Footnote: If necessary, the foots should be
given at the bottom of the diagram to explain certain facts or points which are
not included in the title.
Index: A brief index showing various types
of shades, colours, lines should be used for clear understanding of the diagram.
There are different methods of presenting data in the diagram
which are as follows:
•
Bar
diagrams (One dimensional diagram)
•
Pie-diagrams
or pie-charts. (Two dimensional diagram) Bar
Diagrams (One dimensional diagram)
Bar diagrams
are one of the simplest, easiest and commonly used diagrammatical methods of
presenting scientific, health, medical, engineering, business, economic,
research data. Bar diagrams consists of a set of rectangles one for each set of
data in which the magnitudes or values are represented by the length (or
height). Bar diagrams are one dimensional diagram since only one dimension
(i.e. height or length) of the bars are taken into consideration. It is more
suitable for nominal or ordinal categories.
Before
drawing bar diagrams, the following points are to be kept in mind:
•
All
the bars must be drawn on the same base line.
•
The
breadths of the bars must be same.
•
The
height (or length) of the bars must be proportional to the magnitude (or
values).
•
Bars
can be drawn vertically or horizontally depending upon the convenience.
•
The
gap (distance) between the two successive bars should be same. Types of bar diagrams
•
Simple
bar diagram
•
Sub-divided
or component bar diagram
•
Percentage
sub-divided bar diagram
•
Multiple
bar diagram
Simple bar diagram
Simple bar
diagram is a diagram for only one variable. It is the simplest among other bar
diagrams and frequently used for the comparative study of two or more values of
a single variable. It consists of a set of equidistant bars (or rectangles) of
equal width. The heights or lengths of the bars are represented by the given
values of the variable.
Example 1 The
following information provides the number of tourists by major nationality in
1995 in Nepal. Present the data in the form of a suitable diagram.
Solution:
Nationality |
No. of tourists (millions) |
India UK USA Australia Others |
2.9 1.2 2.5
1.8 1.5 |
Nation
Example 2
Prepare the
simple bar diagram from the following data:
Year : 2002
2003 2004 2005
2006
No. of
Patients’ record: 400 150 300 200
500
Simple bar
diagram is useful to represent only one characteristic or classification at a
time. But there are some situations when two or more characteristics have to be
presented at a time, then sub-divided or component bar diagram will be
suitable. Sub divided or component bar diagram is a diagram for two or more
components of a total. Therefore, it is particularly useful if the total
magnitude of the considering variable is to be divided into different parts or
segments.
Example1
Represent the following information regarding work
participation rate by ecological region of Nepal through sub-divided bar
diagram.
Region |
Years |
||
1971 |
1981 |
1991 |
|
Mountains |
70.0 |
75.7 |
74.5 |
Hills |
62.9 |
68.8 |
62.8 |
Terai |
51.3 |
58.8 |
48.6 |
Solution
Sub-divided bar diagram
Index
Mountai
Hills
Terai
Year
Example 2
Represent the following data by means of a sub-divided bar
diagram.
Year
Patients
Male Female Children
2002
100 200 85 2003 200 180 150
Solution:
Year
Patients
Male Female Children Total
2002
100 200 85 385
2003
200 180 150 530
Index
Male
Female
Children
Year Example3
Represent the following information by sub-divided bar
diagram.
Items Expenditure |
of |
Family A (Income Rs. 5000) |
Family B (Income Rs. 3000) |
Food |
|
1500 |
1500 |
Clothing |
|
1250 |
600 |
Education |
250 |
|
500 |
|
Miscellaneous |
1900 |
|
700 |
|
Saving or deficit |
+ 100 |
Sub-divide |
d bar diagram-300 |
|
|
|
Index
Percentages Sub-divided Bar Diagram
Percentage
sub-divided bar diagram is also sub-divided bar diagram but presented data in
terms of percentage. It is used for comparing the relative changes in the data.
In order to show the relative changes in the data, percentage sub-divided bar
diagram will be more appropriate than simple sub-divided bar diagrams.
The height of each bar is same i.e. 100, since
sum of percentage values is equal to 100.
Example 1
Percentage breakup of the cost of construction a house in
Kathmandu (land cost excluded) is given below:
Labour = 25%
Bricks=15% Cement = 20%
Steel = 15%
Timber=10% Supervision=15%
Construct the percentage sub-divided bar diagram
Solution:
Multiple Bar Diagram
Multiple bar
diagram is also one dimensional bar diagram and it is used to compare two or
more interrelated sets (series) of data through bar diagram. Therefore, in
multiple bar diagram multiple characteristics of the same set of data have to
be presented and compared. A constant distance should be maintained between the
bars of interrelated sets of data. In order to differentiate bars, different
colours or shades are used.
Suppose the records of passed students in
first, second and third division in a certain examination for different years
of a certain school can be represented by using multiple bar diagram as shown
below.
Multiple bar diagram showing passed
students based on hypothetical data
Example 1
Represent the following data of health personnel per 100,000
population and its growth between 1991/92 and 2010/11 by means of multiple bar
diagram using excel.
Personnel |
Nu |
mber of Personnel in |
|
1991/92 |
2001/02 |
2010/11 |
|
Doctor Nurse Health Assistance Health Worker |
5 10 7 2 |
6 15 10 30 |
10 25 20 90 |
Example 2
Represent the following data with the help of
a multiple bar diagram.
Country |
No. of Tourists in Year 2010 |
No. of Tourists in Year 2015 |
USA UK France India China Others |
220 140 80 500 300 150 |
300 200 150 350 250 280 |
Solution:
2010 2015
Example 3 Year
Draw multiple bar diagram from the following information:
Disease |
|
Household
percenta |
ge |
Terai |
Hills |
Mountains |
|
Jaundice |
5.5 |
4.0 |
6.0 |
Typhoid |
7.0 |
5.5 |
8.0 |
Diarrhoea |
12.9 |
14.1 |
20.5 |
Graphical Presentation of Data
Like
diagrammatic presentation of data another method of presenting this is a
graphical presentation so that the information regarding the presented data can
be easily understandable. Even when data show complex relations among
variables, graphs also make them much clear at a glance. Graphs facilitate to
interpret and to analyze the data. They are much more helpful in depicting
median, mode, skewness, kurtosis, correlation, regression and time series
analysis etc.
Graph should
be neat, simple, clear and self explained. It should also have self- explained
title. If necessary, footnotes, source notes and indices are also to be kept in
appropriate place in the graph. Arrangements of scales are also the important
part while constructing graphs.
Generally
graphs can be divided into two categories which are as follows.
(c) Graph of time series
(d) Graph of frequency distribution or
frequency graph (e)
Distinguish Between Diagrams and graphs
Diagrams |
Graphs |
1. They are constructed on plane paper. 2. They may be one, two, and three dimensional. 3. The numerical data in diagrams
are presented by bars, rectangles,
circles, cubes, etc. 4. They are only for the comparison. 5. Construction of diagrams is easier
than that of construction of graph. 6. Diagrams are more attractive and
common to all individual. 7. Presentation of frequency
distribution in diagram is not used. 8. They are rarely used by
statistician and research workers. |
1. They are constructed on graph paper. 2. They are of two dimensional in
general. 3. The numerical data are presented in
terms of points and lines in graph. 4. They help in studying the further
mathematical and statistical relationship.
5. Construction of graph is more
difficult than that of diagram. 6. The graphs are only common to
statistician and researcher. 7. Presentation of frequency
distribution in graph is more appropriate. 8. They are frequently used. |
Pareto Diagram
It is a
diagram containing bars and line graph in which the categorized responses are
plotted in descending rank order of their frequencies by bars and cumulative
frequencies by line graph. The categorical responses in descending order are
taken along horizontal line, percentage cumulative frequencies are taken along
right vertical line and frequencies are taken along left vertical line. Equal
spaced bars of equal width are drawn on horizontal axis whose height are
proportional to corresponding
frequencies, then percentage cumulative frequencies are plotted against
mid points of vertical bars and joined by lines.
•
It
is used in statistical control of process and product quality.
•
It
is also useful tool for presenting categorical data particularly when the
number of classifications or grouping increases.
Example: For a computing controlled lathe
whose performance was below par, workers recorded the following causes and
their frequencies;
Power fluctuations |
Controller stable |
not |
Operator error |
Worn tool not replaced |
Other |
6 |
22 |
|
13 |
5 |
2 |
Solution
Arrange
data in descending order and obtain frequencies and percentage cumulative
frequencies as follow;
Categories |
Frequency |
Cumulative frequency |
Percentage cumulative frequency |
Controller not stable |
22 |
22 |
46 |
Operator error |
13 |
35 |
73 |
Power fluctuation |
6 |
41 |
85 |
Worn tool not replaced |
5 |
46 |
96 |
Others |
2 |
48 |
100 |
Two-dimensional diagram
Bar diagrams
are one dimensional diagram since the magnitude of the given data is
represented through only one dimension. Generally, their magnitudes are
measured only by either height or length of bars. In two dimensional diagrams,
the lengths as well as the width of bars are considered. Thus, the magnitudes
of the given data are presented by the area of the diagram.
The
different types of two dimensional diagrams are rectangles, squares and
circles. But here we consider only circle (i.e. Pie-diagram or Pie-chart)
Pie-diagram (Pie-chart)
Pie diagram
is another popular and widely used method for diagrammatic presentation in the
form of a circle whose area represents the total value. It is a two dimensional
diagram. It is also known as pie-chart or circular diagram or angular diagram.
In this method all the given values are converted in terms of angles so that
the sum of these all angles equal to 360 degree.
The main
drawback of pie-diagram is if the negative values such as loss or deficit of
the several small sectors are to be presented, pie-diagram cannot be used.
The given values are converted into the angles
in the following ways:
Total value
= 100% = 360° (The angle at the centre of the circle = 360°)
Then, Central angle for any given value =
Example1
Express the following data using pie-chart.
Items |
Expenditure as percent of total |
Food |
50 |
Clothing |
15 |
Housing |
10 |
Fuel |
5 |
Education |
10 |
Entertainment
|
5 |
Miscellaneous
|
5 |
Solution
Since, the angle at the centre of the
circle = 360°
Here, total value = 360°
100% = 360°
Suppose, x =
100360 × x
°
= (3.6 × x) °
Calculation for pie-diagram
Items |
% of exp. |
Values in angle = (3.6 × x) |
Food |
50 |
3.6 × 50 = 180° |
Clothing |
15 |
3.6 × 15 = 54° |
Housing |
10 |
3.6 × 10 = 36° |
Fuel |
5 |
3.6 × 5 = 18° |
Education |
10 |
3.6 × 10 = 36° |
Entertainment
|
5 |
3.6 × 5 = 18° |
Miscellaneous
|
5 |
3.6 × 5 = 18° |
Construct pie diagram for the following weekly family
expenditures of two families F1 and F2.
Items |
Family F1 (Income Rs. 400) |
Family F2 (Income Rs. 900) |
Food |
200 |
350 |
Education |
50 |
100 |
Fuel |
50 |
150 |
Entertainment |
20 |
90 |
Miscellaneous |
80 |
210 |
Solution
For
family F1:
Taking Rs. 400 = 360°
Re. 1 =
For
food item, Rs. 200 = × 200 = 180°
In similar fashion,
other angles are calculated.
For
family F2:
Taking Rs. 900 = 360°
360 °
Re. 1 = 900
360 ° For food
item, Rs. 350 = 900 × 350 = 140°
Similarly others can be calculated.
Calculation for pie-diagram
Items |
For family F1 |
For Family F2 |
|
||
Expenditure |
Angle |
Expenditure |
Angle |
||
Food |
200 |
180° |
350 |
|
140° |
Education |
50 |
45° |
100 |
|
40° |
Fuel |
50 |
45° |
150 |
|
60° |
Entertainment |
20 |
18° |
90 |
|
36° |
Miscellaneous |
80 |
72° |
210 |
|
84° |
Total |
400 |
360° |
900 |
|
360° |
Square root |
20 |
|
30 |
|
|
Radius |
20 r1 = 30 = 0.67 |
|
r2 = |
= 1.50 |
|
Graphical Presentation of Data
Like
diagrammatic presentation of data another method of presenting this is a
graphical presentation so that the information regarding the presented data can
be easily understandable. Graphical presentation of data is technique of
presenting the data visually in the form of graph. Even when data show complex
relations among variables, graphs also make them much clear at a glance. Graphs
facilitate to interpret and to analyze the data. They are much more helpful in
depicting median, mode, skewness, kurtosis, correlation, regression and time
series analysis etc.
Graph should
be neat, simple, clear and self explained. It should also have self explained
title. If necessary, footnotes, source notes and indices are also to be kept in
appropriate place in the graph. Arrangements of scales are also the important
part while constructing graphs.
Importance
of graphical presentation of the data is as follows:
•
The
graph presents the data in such a fashion that it is easier for the user to
understand.
•
It
gives attractive presentation to the statistical data as compared to the table.
By looking at the graph, it is easier to notice the main features of the data
at a glance in trends and fluctuations.
•
It
is time saving device.
•
It
facilitates comparison of data relating to two different periods of time or to
two different regions.
•
It
does not require more knowledge of mathematics on the part of the user to
understand the message of graph.
Generally
graphs can be divided into two categories which are as follows.
(f)
Graph
of time series
(g)
Graph
of frequency distribution or frequency graph
1. Graph of time series
This is a
graphic presentation of chronological (varying according to time) data.
Generally time variable is kept in x-axis and other considering variable is
kept in y-axis.
Graph of Frequency Distribution or Frequency Graphs
This is another category of graphs which shows the graphical
relationship between the variable values and their corresponding frequencies of
the data under consideration. The following are the main types of frequency
graphs.
(a)
Histogram
(b)
Frequency
polygon
(c)
Frequency
curve
(d)
Cumulative
frequency curve (or ogive)
Attempts
have been made to discuss each of these along with suitable examples and
illustrations.
Histogram
Histogram is
most common, popular and widely used graphic presentation of frequency
distribution. It consists of a set of adjacent vertical rectangles on X-axis
with bases equal to the width of the corresponding class intervals and heights
proportional to the corresponding frequency of each class so that the area of
the rectangle is proportional to the frequency of the corresponding class.
Histogram is used to describe numerical data
which have been grouped into frequency, relative frequency or percentage
distributions. Variable under consideration is kept along X-axis and frequency
or relative frequency or percentage of frequency per class interval is kept in
vertical y-axis. A histogram constructed keeping variables along X-axis and
their corresponding relative frequency in Y-axis is called relative frequency
histogram.
Histogram is useful to locate mode and it is also used to
test the normality of the distribution.
Histogram with equal width classes
If the class
intervals in the frequency distribution under consideration are of equal width
then the vertical bars in the histogram are also of equal width. Such type of
histogram is called histogram with equal width classes. The height of each
rectangle will be the corresponding frequency of the each class interval.
Example1
Construct a
histogram from the following grouped frequency distribution.
Class interval |
5-10 |
10-15 |
15-20 |
20-25 |
25-30 |
30-35 |
Frequency |
2 |
5 |
7 |
12 |
6 |
3 |
Solution
Example 2 Class interval60
Plot a
histogram for the following frequency distribution and locate the mode.
Marks |
0-20 |
20-40 |
40-60 |
60-80 |
80-100 |
No.of students |
10 |
22 |
35 |
28 |
5 |
Solution: This is the case of equal class size
(or equal class width)
Mode (Mo) =
52 marks Marks 60
Example 3
Following is the distribution of marks obtained by 50
students. Plot a histogram for these data.
Marks |
20-29 |
30-39 |
40-49 |
50-59 |
60-69 |
70-79 |
No. of students |
5 |
10 |
20 |
8 |
6 |
1 |
Solution:
Correction factor =
Converting the given inclusive class intervals into exclusive
classes by subtracting correction factor from each lower limit and adding to
each upper limit.
Marks |
19.5-29.5 |
29.5-39.5 |
39.5-49.5 |
49.5-59.5 |
59.5-69.5 |
69.5-79.5 |
No. of students |
5 |
10 |
20 |
8 |
6 |
1 |
Then the histogram is plotted as follows:
Histogram
Marks Example 4
Draw a
histogram from the following distribution:
Mid-point:
5 10 15
20 25 30
Frequency: 1
3 5 10
5 3 Solution:
Since, mid. values are given, so at first class
intervals are constructed.
Class size (h) = Difference between two successive mid
values,
= 10-5 h
= 5
Subtract = 2.5 from the first middle value for lower limit of first
class interval and add to the same mid value
for the upper limit of first class interval and so on. Other class intervals
are constructed in the similar fashion as shown in the calculation table below:
Class interval |
2.5-7.5 |
7.5-12.5 |
12.5-17.5 |
17.5-22.5 |
22.5-27.7 |
27.5-32.5 |
frequency |
1 |
3 |
5 |
10 |
5 |
3 |
Histogram with unequal width classes
If the
widths of all class intervals under consideration are not of equal widths, then
all class intervals are to be made with equal width class intervals and the
corresponding frequencies are also to be adjusted so that the height of
rectangles will be proportional to the ratio of the class frequency to the
corresponding class size.
Example 5
Represent the following daily wage distribution of 125
workers by means of histogram.
Wage (in Rs.) |
10-20 |
20- 30 |
30- 40 |
40-50 |
50-70 |
70-110 |
No. of workers |
8 |
19 |
28 |
18 |
20 |
20 |
Solution:
This is the case of unequal class interval, so adjustment of
the frequencies must be made. The class size of first four class intervals is
10, that of fifth is 20 which is double of 10. Also, the class size of sixth
class is 40 which is 4 times of 10 and class size of last is 40 which is 4
times of 10. Their corresponding frequencies are adjusted as follows:
Calculation of adjusted frequency
Wage (in Rs.) |
10-20 |
20- 30 |
30- 40 |
40- 50 |
50-70 |
70-110 |
No. of workers (f ) |
8 |
19 |
28 |
18 |
20 |
20 |
Magnitude of class |
10 |
10 |
10 |
10 |
20 |
40 |
Adjusted frequency (or height of histogram) |
8 |
19 |
28 |
18 |
20/2 =
10 |
20/4 = 5 |
Wages 60 (Rs)
Frequency Polygon
Frequency polygon is another method of graphical presentation
frequency distribution. A frequency polygon can be constructed generally by
joining the mid-points of the top of the adjacent rectangles (histograms) by
straight line. Frequency polygon can also be drawn without histogram. In this
case, frequency polygon is constructed keeping the midpoint of each class
interval in X-axis and the corresponding frequency in the vertical Y-axis.
Example 6
Prepare a histogram and frequency polygon from the following
data:
Class interval |
0-5 |
5-10 |
10-15 |
15-25 |
25-30 |
30-35 |
Frequency |
4 |
8 |
15 |
20 |
12 |
2 |
Solution:
Here, the
width of the classes is unequal. The width of the classes is 5 and 10. The
width of 15-25 is 10, which is 2 times the width of the other classes. So, the
frequency of 15-25 is divided by 2 i.e. 20/2 = 10. It is the height of the
rectangles of the class 15-25. The height of the rectangles of other classes is
same as frequency.
Class
interval60
Example 7
For the following frequency distribution, plot Frequency
polygon without histogram.
Class |
15.5-20.5 |
20.5-25.5 |
25.5-30.5 |
30.5-35.5 |
35.5-40.5 |
Frequency |
2 |
7 |
14 |
5 |
3 |
Solution:
Class |
15.5-20.5 |
20.5-25.5 |
25.5-30.5 |
30.5-35.5 |
35.5-40.5 |
Frequency |
2 |
7 |
14 |
5 |
3 |
Mid. value |
18 |
23 |
28 |
33 |
38 |
Frequency polygon without Histogram
Frequency Curve Mid. value
A smooth
free hand curve drawn through the vertices of a frequency polygon is known as
the frequency curve. The frequency polygon is smoothened in such a way that the
area enclosed by frequency curve is same as the frequency polygon but its shape
must be smooth.
Example 1
Construct a histogram, frequency polygon and frequency curve
from the following data:
Class interval |
0-5 |
5-10 |
10-15 |
15-25 |
25-30 |
30-35 |
Frequency |
8 |
16 |
30 |
40 |
24 |
4 |
Solution:
Here, the
width of the classes is unequal. The width of the classes is 5 and 10. The
width of 15-25 is 10, which is 2 times the width of the other classes. So, the
frequency of 15-25 is divided by 2 i.e. 40/2 = 20. It is the height of the
rectangles of the class 15-25. The height of the rectangles of other classes is
same as frequency.
Class interval60
Cumulative Frequency Curve (or ogive)
Cumulative
frequency curve is a graphical presentation of cumulative frequency
distribution. Cumulative frequency curve is also known as “Ogive”. In drawing
Ogive, the points are plotted with cumulative frequency (c.f.) along y-axis and
the corresponding class intervals along x-axis and joining them freely. Relative cumulative frequency curve can also
be drawn by keeping the relative cumulative frequencies in Yaxis and the
variable values in the X-axis. Similarly cumulative percentage frequency curve
can also be drawn keeping the cumulative percentage frequency in the vertical
Y-axis.
Cumulative
frequency curves are helpful to locate the partition values viz. quartiles,
deciles, percentile and median.
There are two types of cumulative curve (or ogive) which are
as follows. (i) Less than cumulative frequency curve
(or less than ogive)
(ii) More than
cumulative frequency curve (or more than ogive)
Less than Cumulative Frequency Curve
This is a
cumulative frequency curve obtained by plotting the values keeping the upper
limit of each class interval in the horizontal X-axis and their corresponding
less than cumulative frequencies as the Y-axis. Less than relative cumulative
percentage frequency curve (or less than relative frequency) can also be drawn
in a similar fashion.
Example
Construct the following given from the data given below (a)
Less than cumulative frequency
curve.
(b)
Less
than relative cumulative frequency curve.
(c)
Less
than cumulative percentage curve.
Period wise earnings of teacher (in Rs.) |
100-200 |
200-300 |
300-400 |
400-500 |
500-600 |
600-700 |
700-800 |
Total |
No. of teachers (f) |
5 |
10 |
22 |
30 |
16 |
12 |
15 |
110 |
Solution
(a)
Calculation
for less than cumulative frequency curve.
Earnings (in Rs.) |
No. of
teachers less than cumulative frequency |
Less than 200 |
5 |
Less than 300 |
15 |
Less than 400 |
37 |
Less than 500 |
67 |
Less than 600 |
83 |
Less than 700 |
95 |
Less than 800 |
110 |
Less than
ogive curve (Less than cumulative frequency curve)
Period wise earning (Rs)
(b)
Calculation
for less than relative cumulative frequency curve
Earnings
(in Rs.) |
No. of teachers (Less than c.f.) |
No. of teachers (Less than c.f.) |
Less than 200 |
5 |
0.05 |
Less than 300 |
15 |
0.14 |
Less than 400 |
37 |
0.34 |
Less than 500 |
67 |
0.61 |
Less than 600 |
83 |
0.75 |
Less than 700 |
95 |
0.86 |
Less than 800 |
110 |
1 |
Weekly wage (Rs)
(c)
Calculation
for less than cumulative percentage frequency curve.
Weekly wage (Rs)
Example 2
Draw a less than Ogive from the following data. Obtain the
number of workers whose wages lies between Rs.600 and Rs.700
Weekly wage
(Rs) 500-550 550-600 600-650 650-700
700-750
No. of
workers 20 35 45 30 25
Solution:
Weekly wage (Rs.) |
No. of workers (Less than c.f.) |
Less than 550 |
20 |
Less than 600 |
55 |
Less than 650 |
100 |
Less than 700 |
130 |
Less than 750
|
155 |
Less than ogive curve (less than c.f.
curve)
D7
From ogive
curve, the number of workers whose wages lies between Rs.600 and Rs.700 is 84.
More than Cumulative Frequency Curve (or more than ogive)
This is
another cumulative frequency curve obtained by plotting the values keeping the
lower limit of each class interval in horizontal X-axis and their corresponding
more than cumulative frequency in vertical Y-axis. More than relative
cumulative frequency curve (or more than cumulative percentage frequency curve)
can also be drawn in a similar manner keeping more than relative cumulative
frequency (or more than cumulative percentage frequency) in vertical Y-axis and
lower limit of C.I. in X-axis. Note that the foot of perpendicular drawn in
X-axis from the point of intersection of less than ogive and more than ogive
gives the median value
Example 1
Given the following distribution, plot more than cumulative
frequency curve.
Class |
40-50 |
50-60 |
60-70 |
70-80 |
80-90 |
90-100 |
Frequency |
4 |
8 |
4 |
24 |
6 |
4 |
Solution
Calculation for more than cumulative
frequency curve
Class |
More than cumulative
frequency |
More than 40 |
50 |
More than 50 |
46 |
More than 60 |
38 |
More than 70 |
34 |
More than 80 |
10 |
More than 90 |
4 |
More than ogive curve
Example 2
The following data represents the number of days during a
year that the faculty of the college was absent from work. Plot less than ogive
and more than ogive in a same plane and hence locate the median value.
Number of days |
Number of employees |
0-3 |
5 |
3-6 |
10 |
6-9 |
20 |
9-12 |
10 |
12-15 |
5 |
Total |
50 |
Solution
Calculation for less than ogive and more
than ogive.
Number of days |
No. of employee (less than c.f.) |
No. of employee (more than c.f.) |
Less than 3 |
5 |
50 |
Less than 6 |
15 |
45 |
Less than 9 |
35 |
35 |
No comments:
Post a Comment