**INTRODUCTION**

Classical information theoretic divergence measures have witnessed the need to study them. Kullback and Leibler (1951) first studied the measure of divergence. Jaynes (1957) introduced the Principle of Maximum Entropy (PME) and emphasized that “choose a distribution which is consistent to with the information available and is uniform as possible”. For implementation of this consideration another advance was needed in the form of a measure of nearness of two probability distribution and it was already provided by Kullback Leibler (1951) in the form of:

If the distribution Q is uniform. This becomes:

where, P, Q∈T_{n} and:

Since (Shannon, 1948) Entropy:

was already available in the literature, so maximizing H is equivalent to minimizing I (P, Q). This is one of the interpretations of PME.

Analyzing Eq. 1 in the following way:

The second term present in Eq. 4 is called the Kerridge Inaccuracy which is:

Considering (Kerridge, 1961) inaccuracy, we can interpret (Kullback and Leibler, 1951) measure of divergence as:

i.e.,

= Difference of Kerridge inaccuracy and Shannon’s entropy:

Since I (P, Q) provides a measure of nearness of P from Q. Take the case of Reliability Theory; here we can consider how much the information is reliable. Because the distribution is the revised distribution/strategies to achieve the goal/objective/target with certain constraints, so optimization theory takes the birth, which is the need of every one.

Hence whenever we come across divergence measures, we are interested to minimize the divergence to make the information available, reliable. Every walk of life is governed with the reliability of information under certain constraints.

Analogous to information theoretic approach, when we arrive at fuzzy sets or fuzziness, we need to study fuzzy divergence measures. As presently, the vast applications of fuzzy information in life and social sciences, interpretational communication, Engineering , Fuzzy Aircraft Control, Medicine, Management and decision making, Computer Sciences, Pattern Recognition and Clustering. Hence the wide applications motivate us to consider Divergence Measures for fuzzy set theory to minimize or maximize or optimize the fuzziness.

Let A = {x_{i}: μ_{A}(x_{i}), ∀i = 1, 2, ..., n} and B = {x_{i}: μ_{B} (x_{i}), ∀i = 1, 2, ..., n} where, 0<μ_{A}(x_{i})<1 and 0<μ_{B}(x_{i})<1, be two fuzzy sets. The fuzzy divergence corresponding to Kullback and Leibler (1951) has been defined by Bhandari and Pal (1993) as:

The fundamental properties of fuzzy divergence are as follows:

• | Non-negativity, i.e., D (A||B)≥0 |

• |
D (A||B) = 0, if A = B |

• |
D (A||B) is a convex function in (0, 1) |

• |
D (A||B) should not change, when μ_{A}(x_{i}) is changed to 1-μ_{A}(x_{i}) and μ_{B}(x_{i}) to 1-μ_{B}(x_{i}) |

Bhandari and Pal (1993) have established some properties such as:

• | D (A||B) = I (A||B)+I (B||A) |

• | D (A∪B||A∩B) = D (A||B) |

• |
D (A∪B||C)≤ D (A||C)+D (B||C) |

• |
D (A||B)≥ D (A∪B||A) |

• |
D (A||B) is maximum if B is the farthest non-fuzzy set of A |

Havrda and Charvat (1967) has given the measure of directed divergence as:

Corresponding to Eq. 8, the average code word length can be taken as:

Corresponding to Eq. 8, the fuzzy measure of directed divergence between two fuzzy sets μ_{A}(x_{i}) and μ_{B}(x_{i}) can taken as:

And its corresponding fuzzy average code word length as:

**Remark:**

• | As α→1, Eq. 8 tends to Eq. 1 |

• |
As α→1 and q_{i} = 1, Eq. 8 tends to Eq. 3 |

• |
As α→1 Eq. 9 tends to average codeword length given as: |

• | As α→1 and q_{i} = 1, Eq. 9 tends to average codeword length corresponding to Shannon’s entropy given as: |

**NOISELESS DIRECTED DIVERGENCE CODING THEOREMS**

**Theorem 1:** For all uniquely decipherable codes:

Where:

**Proof:** By Holders inequality, we have:

Set:

and:

Thus Eq. 13 becomes:

Using Kraft’s inequality, we have:

or:

or:

dividing both sides by t, we get:

Subtracting n from both sides, we have:

Taking α = t+1, t = α-1 and:

Equation 15 becomes:

Dividing both sides by α, we get:

that is D_{α}≤ L_{α} which proves the theorem.

**Theorem 2:** For all uniquely decipherable codes:

Where:

where, either α≥1, β≤1 or α≤1, β≥1

**Proof:** Since from Eq. 16, we have:

Multiplying both sides by (α-1), we get:

Changing α to β, Eq. 20 becomes:

Subtracting Eq. 21 from 20, we have:

Dividing both sides by (β-α), we have:

That is D_{α, β}≤L_{α, β} which proves the theorem.

**Theorem 3:** For all uniquely decipherable codes:

where:

and:

To prove this theorem, we first prove the following lemma:

**Lemma1: **For all uniquely decipherable codes:

**Proof of the lemma: **From (3) we have:

Subtracting ‘n’ from both sides, we have:

Taking α = t+1, t = α-1 and:

we have:

Which proves the Lemma.

**Proof of the Theorem 3:** Changing α to β in (Eq. 24), we get:

dividing Eq. 25 from 24, we have:

Dividing both sides by β-α we have:

that is D΄_{α, β}≤L΄_{α, β}. which proves the theorem.