SlideShare una empresa de Scribd logo
1 de 94
Descargar para leer sin conexión
NEWCOMB-BENFORD’S LAW APPLICATIONS TO
ELECTORAL PROCESSES, BIOINFORMATICS, AND
              THE STOCK INDEX




                           By
                David A. Torres N´nez
                                 u˜




       SUBMITTED IN PARTIAL FULFILLMENT OF THE
           REQUIREMENTS FOR THE DEGREE OF
                   MASTER OF SCIENCE
                           AT
              UNIVERSITY OF PUERTO RICO
               RIO PIEDRAS, PUERTO RICO
                        MAY 2006




        c Copyright by David A. Torres N´ nez, 2006
                                        u˜
UNIVERSITY OF PUERTO RICO
                        DEPARTMENT OF
                          MATHEMATICS


      The undersigned hereby certify that they have read and
recommend to the Faculty of Graduate Studies for acceptance
a thesis entitled “Newcomb-Benford’s           Law    Applications    to
Electoral Processes, Bioinformatics, and the Stock Index”
by David A. Torres N´ nez in partial fulfillment of the requirements
                    u˜
for the degree of Master of Science.




                                                      Dated: May 2006




      Supervisor:
                                      Dr. Luis Ra´l Pericchi Guerra
                                                 u




      Readers:
                                         Dr. Mar´ Egl´e P´rez
                                                ıa   e e




                                            Dr. Dieter Reetz.




                                 ii
UNIVERSITY OF PUERTO RICO


                                                            Date: May 2006

Author:        David A. Torres N´ nez
                                u˜

Title:         Newcomb-Benford’s Law Applications to Electoral
               Processes, Bioinformatics, and the Stock Index

Department: Mathematics
Degree: M.Sc.           Convocation: May             Year: 2006


        Permission is herewith granted to University of Puerto Rico to circulate
and to have copied for non-commercial purposes, at its discretion, the above
title upon the request of individuals or institutions.




                                              Signature of Author


      THE AUTHOR RESERVES OTHER PUBLICATION RIGHTS, AND
NEITHER THE THESIS NOR EXTENSIVE EXTRACTS FROM IT MAY
BE PRINTED OR OTHERWISE REPRODUCED WITHOUT THE AUTHOR’S
WRITTEN PERMISSION.
      THE AUTHOR ATTESTS THAT PERMISSION HAS BEEN OBTAINED
FOR THE USE OF ANY COPYRIGHTED MATERIAL APPEARING IN THIS
THESIS (OTHER THAN BRIEF EXCERPTS REQUIRING ONLY PROPER
ACKNOWLEDGEMENT IN SCHOLARLY WRITING) AND THAT ALL SUCH USE
IS CLEARLY ACKNOWLEDGED.




                                      iii
To my family, and the extending that always keep faith in
                          me.




                           iv
Table of Contents

Table of Contents                                                                     v

List of Tables                                                                       vii

List of Figures                                                                      ix

Abstract                                                                               i

Acknowledgements                                                                      ii

Introduction                                                                          1

1 Basic Notation and Derivations                                                      4
  1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    4
  1.2 Derivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     5
      1.2.1 A Differential Equation Approach. . . . . . . . . . . . . . . .            5
      1.2.2 The Float Point Notation Scheme. Knuth . . . . . . . . . . .              7
      1.2.3 In the Float Point Notation Scheme. Hamming . . . . . . . .               8
      1.2.4 The Brownian Model Scheme. Pietronero . . . . . . . . . . . .            10
  1.3 A Statistical Derivation of N-B L . . . . . . . . . . . . . . . . . . . .      11
      1.3.1 Mantissa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     12
      1.3.2 A Natural Probability Space . . . . . . . . . . . . . . . . . . .        15
      1.3.3 Mantissa σ-algebra Properties . . . . . . . . . . . . . . . . . .        15
      1.3.4 Scale and Base Invariance . . . . . . . . . . . . . . . . . . . .        17
                                    k
  1.4 Mean and Variance of the Db . . . . . . . . . . . . . . . . . . . . . .        23
  1.5 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   24
      1.5.1 Generating r Significant Digit’s Distribution Base b. . . . . .           24
      1.5.2 Effects of Bounds in the Newcomb-Benford Generated Values.                25



                                           v
2 Empirical Analysis                                                                                  30
  2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   30
  2.2 Changing P-Values in Null Hypothesis Probabilities H0           .   .   .   .   .   .   .   .   31
      2.2.1 Posterior Probabilities with Uniform Priors . . .         .   .   .   .   .   .   .   .   33
  2.3 Multinomial Model Proposal . . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   36
  2.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   37
  2.5 Conclusions of the examples . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   41

3 Stock Indexes’ Digits                                                                               44
  3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                    44
  3.2 Statistical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .                    44
  3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                   46

4 On Image Analysis in the Microarray Intensity Spot                                                  49
  4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   49
  4.2 Experiment . . . . . . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   .   .   50
      4.2.1 Microarray measurements and image processing              .   .   .   .   .   .   .   .   52
  4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . .   .   .   .   .   .   .   .   .   53

5 Electoral Process on a Newcomb Benford Law Context.                                                 57
  5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .        .   .   .   .   .   .   57
  5.2 General Democratic Election Model . . . . . . . . . . . . .             .   .   .   .   .   .   58
  5.3 Empirical Data . . . . . . . . . . . . . . . . . . . . . . . .          .   .   .   .   .   .   59
  5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . .         .   .   .   .   .   .   72

6 Appendix: MATLAB PROGRAMS.                                                                          75
  6.1 Matlab Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                      75




                                           vi
List of Tables

 1     Newcomb Benford Law for the First Significant Digit . . . . . . . . .              2

 1.1   Mean, Variance, Standard Deviation and Variation Coefficient for the
       First and Second Significant Digit Distributions. . . . . . . . . . . . .         24

 2.1   p- values in terms of Hypotheses probabilities. . . . . . . . . . . . . .        32
 2.2   Summary of the results of the above examples. . . . . . . . . . . . . .          41

 3.1   N-Benford’s for 1st and 2nd digit: p- values, Probability Null Bound
       and Approximate probability for of the different increment . . . . . .            47
 3.2   N-Benford’s for 1st and 2nd digit: The probability of the null hypothesis
       given the data and the length of the data. . . . . . . . . . . . . . . .         47

 4.1   N-Benford’s for 1st and 2nd digit: P (H0|data), P (Approx), P (Frac)
       and P r(BIC). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      54
                          st      nd
 4.2   N-Benford’s for 1 and 2         digit; the number of observations, p-values. 55

 5.1   The second digit proportions analysis of the winner for the set of his-
       torical elections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   59
 5.2   The second digit proportions analysis of the loser for the set of histor-
       ical elections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    60
 5.3   The first digit proportions of the distance between the winner and the
       loser for the set of historical elections. . . . . . . . . . . . . . . . . . .   60




                                           vii
5.4   The second digit proportions of the distance between the winner and
      the loser for the set of historical elections. . . . . . . . . . . . . . . .       61
5.5   The second digit proportions of the sum between the winner and the
      loser for the set of historical elections. . . . . . . . . . . . . . . . . . .     61
5.6   The Newcomb Benford’s for 1st and 2nd digit: for the United States
      of North America Presidential Elections 2004. Note the close are the
      values of the posterior probability given the data to 1.0. . . . . . . . .         62
5.7   The second digit proportions analysis of the winner for the set of his-
      torical elections.Number of observed values, p-value and probability
      null bound is shown. . . . . . . . . . . . . . . . . . . . . . . . . . . .         62
5.8   The second digit proportions analysis of the loser for the set of histor-
      ical elections.Number of observed values, p-value and probability null
                                                                           1
      bound is shown. Note that p-values should be smaller than            e
                                                                               for the
      bound to be valid.      . . . . . . . . . . . . . . . . . . . . . . . . . . . .    63
5.9   The first digit proportions of the distance between the winner and the
      loser for the set of historical elections. Number of observed values, p-
      value and probability null bound is shown. Note that p-values should
                        1
      be smaller than   e
                            for the bound to be valid. . . . . . . . . . . . . . .       63
5.10 The second digit proportions of the distance between the winner and
      the loser for the set of historical elections.Number of observed values,
      p-value and probability null bound is shown. Note that p-values should
                        1
      be smaller than   e
                            for the bound to be valid. . . . . . . . . . . . . . .       64
5.11 The second digit proportions of the sum between the winner and the
      loser for the set of historical elections. Number of observed values, p-
      value and probability null bound is shown. Note that p-values should
                        1
      be smaller than   e
                            for the bound to be valid. . . . . . . . . . . . . . .       64




                                          viii
List of Figures

 1     Newcomb-Benford Law theoretical frequencies for the first and second sig-
       nificant digit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    3

 1.1   Constrained Newcomb Benford Law compared with a Restricted Bound with
       of digits in K ≤ 99 from numbers between 1 to 99. Here there is no restriction. 28
 1.2   Constrained Newcomb Benford Law compared with a Restricted Bound with
       of digits in K ≤ 50 from numbers between 1 to 99.      . . . . . . . . . . . .   28
 1.3   Constrained Newcomb Benford Law compared with a Restricted Bound with
       of digits in K ≤ 20 from numbers between 1 to 99.      . . . . . . . . . . . .   29

 2.1   Presenting the posterior intervals for the first and digit using symmetric
       boxplot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     38
 2.2   Newcomb-Benford Law theoretical frequencies for the first significant digit.       42
 2.3   Newcomb-Benford Law theoretical frequencies for the first significant digit.
       This represent the example 1 simulation results. . . . . . . . . . . . . . .     42
 2.4   Newcomb-Benford Law theoretical frequencies for the first significant digit.
       This represent the example 2 simulation results. . . . . . . . . . . . . . .     43
 2.5   Newcomb-Benford Law theoretical frequencies for the first significant digit.
       This represent the multinomial example simulation results. . . . . . . . .       43

 4.1   Histograms of the Intensities and the Adjustments. . . . . . . . . . . . .       55
 4.2   N-Benford’s Law compared whit Intensity Micro array Spots Without Ad-
       justment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    56


                                           ix
4.3   N-Benford’s Law compared whit Intensity Micro array Spots With Adjust-
      ment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     56

5.1   Presidential election analysis using electoral college votes compare with
      N-B Law for the 1st digit. . . . . . . . . . . . . . . . . . . . . . . . .        65
5.2   Presidential election analysis using electoral college votes compare with
      N-B Law for the 2nd digit. . . . . . . . . . . . . . . . . . . . . . . . .        66
5.3   Puerto Rico 2096 Elections compare with the Newcomb Benford Law
      for the second digit.     . . . . . . . . . . . . . . . . . . . . . . . . . . .   67
5.4   Puerto Rico 2000 Elections compare with the Newcomb Benford Law
      for the second digit. . . . . . . . . . . . . . . . . . . . . . . . . . . . .     68
5.5   Puerto Rico 2004 Elections compare with the Newcomb Benford Law
      for the first digit.     . . . . . . . . . . . . . . . . . . . . . . . . . . . .   69
5.6   Venezuela Revocatory Referendum Manual Votes Proportions com-
      pared with the Newcomb Benford Law’s proportions for the Second
      Digit.   . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    70
5.7   Venezuela Revocatory Referendum Manual Votes Proportions com-
      pared with the Newcomb Benford Law’s proportions for Second digit.                71
5.8   Venezuela Revocatory Referendum Electronic and Manual Votes Pro-
      portions compared with the Newcomb Benford Law’s for the second
      digit proportions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      73
5.9   Venezuela Revocatory Referendum Manual Distance between the win-
      ner and loser Proportions compared with the Newcomb Benford Law’s
      for the proportions. . . . . . . . . . . . . . . . . . . . . . . . . . . . .      74




                                           x
Abstract

Since this rather amazing fact was discovered in 1881 by the American astronomer
Newcomb (1881), many scientist have been searching about members of the outlaws
number family. Newcomb noticed that the pages of the logarithm books containing
numbers starting with 1 were much more worn than the other pages. After analyzing
several sets of naturally occurring data Newcomb went on to derive what later became
Benford’s law. As a tribute to the figure of Newcomb we call this phenomenon, the
Newcomb - Benford’s Law.
   We start by establishing a connection between the Microarray and Stock Index
data sets. That can be seen as an extension of the work done by Hoyle David C.
(2002) and Ley (1996). Most of the analysis have been made using Classical and
Bayesian statistics. Here is explained differences between the different scopes on the
hypothesis testing between models Berger J.O. and Pericchi L. R. (2001). Finally,
the applications of this concepts to the different types of data including Microarray,
Stock Index and Electoral Process.
   There are several results on constrained data, the most relevant is the Constrained
Newcomb Benford Law and most of the Bayesian Analysis covered, applied to this
problem.




                                          i
Acknowledgements

I wish to express my gratitude to everyone who contributed to making this work pos-
sible. I would like to thank God first also Dr. L. R. Pericchi, my supervisor, for his
many suggestions and constant support during this research. I am also thankful to
the whole faculty of Mathematics for their guidance through the early years of chaos
and confusion.
Doctor Pericchi expressed his interest in my work and supplied me with the preprints
of some of his recent joint work with Berger J. O., which gave me a better perspective
on the results. L.R. Pericchi thank for been more than a supervisor, a father and my
friend. I’m indebted to Dr. Mar´ Egl´e P´rez, Prof. Z. Rodriguez, and Humberto
                                  ıa     e e
Ortiz Zuazaga, for provided data and insights during the drafting process.
Also I would like to thank my parents for providing me with the opportunity to be
where I am. Without them, none of this would even be possible. You have always
been my biggest fans and I appreciate that. To my father: thanks for the support,
even if you are not here anymore. To my mother: you are my hero, always.
I would also like to thank my special friends because you have been my biggest critics
throughout my entire personal life and professional career. Your encouragement, in-
put and constructive criticism have been priceless. For that thanks to Ricardo Ortiz,
Ariel Cintr´n, Antonio Gonzales, Tania Yuisa Arroyo, Erika Migliza, Wally Rivera,
            o
Raquel Torres, Dr. Pedro Rodriguez Esquerdo, Dr. Punchin, Lourdes Vazquez (sis-
ter), Chungseng Yang (brother), Lihai Song and all the extended family.
I would like to thank to my soulmate, Ana Tereza Rodriguez, for keeping me grounded
and for providing me with some memorable experiences.

Rio Piedras, Puerto Rico                                         David Torres N´ nez
                                                                               u˜
May 15, 2006



                                          ii
Introduction

The first known person that explain the anomalous distribution of the digits was

in The Journal of mathematics and was the astronomer and Mathematician Simon,

Newcomb. The one who stated;


     ”The law of probability of occurrences of numbers is such that all mantissa
     of their logarithm are equally probable.”


Since then many mathematicians have been enthusiastic in findings sets of data suit-

able to this phenomena. There has been a century of theorems, definitions, con-

jectures, discoveries around the first digit phenomenon. The discovery of this fact

goes back to 1881, when the American astronomer Simon Newcomb noticed that the
first pages of logarithm books (used at that time to perform calculations), the ones

containing numbers that started with 1, were much more worn than the other pages.

However, it has been argued that any book that is used from the beginning would

show more wear and tear on the earlier pages. This story might thus be apocryphal,

just like Isaac Newton’s supposed discovery of gravity from observation of a falling

apple. The phenomenon was rediscovered in 1938 by the physicist Benford (1938),
who checked it on a wide variety on data sets and was credited for it. In 1996, Hill

(1996) proved the result about random mixed distributions of random variables but


                                          1
2




           Table 1: Newcomb Benford Law for the First Significant Digit

        Digit unit    1    2    3    4    5    6    7    8    9
        Probability   .301 .176 .125 .097 .079 .067 .058 .051 .046



generalizes the Law for dimensionless quantities. Some mathematical series and other

data sets that satisfies Newcomb Benford’s Law are:

   • Prime-numbers

   • Series distributions

   • Fibonacci Series

   • Factorial numbers

   • Sequences of Powers Numbers in Pascal Triangle

   • Demographic Statistics

   • Other Social Science data Numbers that appear in magazines and Newspaper.


   Intuitively, most people assume that in a string of numbers sampled randomly

from some body of data, the first non-zero digit could be any number from 1 through

9. All nine numbers would be regarded as equally probable. As we show in the

figure below the equally probable assumption of the digits are very different from

the Newcomb Benfords Distribution. As an example for the first digit we have the

following discrete probability distribution function.
3


                                                          Newcomb−Benford’s Law First Digit
                                                                                                   st
                                                                                                  1 Digit Law
                                                                                                  y = 1/9
                                       0.2




                         Probability
                                       0.1


                                        0
                                             1   2       3        4       5       6       7             8       9
                                                                      Numbers
                                                         Newcomb−Benford’s Law Second Digit
                                       0.2
                                                                                                   nd
                                                                                                  2 Digit Law
                                                                                                  y = 1/10
                                  0.15
                 Probability




                                       0.1

                                  0.05

                                        0
                                             0   1   2        3       4       5       6       7         8       9
                                                                      Numbers




Figure 1: Newcomb-Benford Law theoretical frequencies for the first and second significant
digit.

    We will present two different situation with different derivations. The first cover
data with units, like dollars or meters. The second involves units free data like counts

of votes. Applications presented here include:

   1. Puerto Rico’s Stock Index.

   2. Microarrays data in Bioinformatics.

   3. Voting counts in Venezuela, United States and Puerto Rico.

The potential uses are detection of fraud, or detection of corruption of data or lack of

proper scaling. We analyze the statistical properties of such a Benford’s distribution

and illustrate Benford’s law with a lot of data sets of both theoretical and real-life

origins. Applications of Benford’s Law such as detecting fraud are summarized and

explored to reveal the power of this mathematical principle.

Most of the work has been type in Latex format Lamport (1986) and Knuth (1984).
Chapter 1

Basic Notation and Derivations


1.1     Introduction

The data sets of the family of outlaw’s numbers came from two different kinds of

classification. The first is the type of numbers that has different units, like money.

The other type of data sets is the numbers that do not have units like votes. This

last type of data sets can be found in Electoral Process and Mathematical series. In

this chapter are introduced some basic concepts and notation that is consistent with
Hill (1996). This formulation helps to understand in a deep way the probabilistic

bases on the Newcomb Benford Law. There are slightly different derivations, most

of them are not as statistically general as the one presented by Hill. Other example

that will be in this discussion is a base invariance, similar to the one presented by

L. Pietronero (2001). The aim of this work is to generalize Newcomb Benford Law

in order to apply them to wider classes of data sets, and to verify its fit to different
seta of data with modern Bayesian Statistical methods.




                                          4
5



1.2     Derivations

We present here some derivations, first of all we will use a heuristic approach based

on invariance. In this first section, Benford’s law applies to data that are not dimen-
sionless, so the numerical values of the data depend on the units.


1.2.1    A Differential Equation Approach.

If there exists a universal probability distribution P(x) over such numbers, then it

must be invariant under a change of scale Hill (1995a), so



                                 P (kx) = f (k)P (x)                           (1.2.1)

   Integrating with respect x we have



                               P (kx)dx = f (k)    P (x)dx

   If   P (x)dx = 1 and f (k) = 0, then

                                                  1
                                     P (kx)dx =     ,
                                                  k

and normalization proceeds as



                  P (kx)dx = 1 taking y = kx then dy = kdx hence,

                   P (y)dy = 1

              k    P (y)dx = 1 finally
                                1
                   P (y)dx =
                                k
6



   Taking derivatives with respect to k



                               ∂P (kx)     ′
                                       = xP (k)
                                 ∂k
                                                  = P (x)f (k)
                                                            −1
                                                  = P (x)
                                                            k2

   setting k = 1 gives;



                                    ′
                                 xP (x) = −P (x)
                                    1 ′      1
                                  x( ) = −x 2
                                    x        x
                                          −1
                                        =
                                          x
                                                  = −P (x)


Hence
                                        ′
                                  xP (x) = −P (x)                                   (1.2.2)

                                            1
This equation has solution P (x) =          x
                                              .   Although this is not a proper probability
distribution (since it diverges), both the laws of physics and human convention impose

cutoffs. If many powers of 10 lie between the cutoffs, then the probability that the

first (decimal) digit is D is given by the logarithmic distribution
7




                                           D+1
                                          D
                                                P (x)dx
                                PD =        10
                                           1
                                               P (x)dx
                                           D+1 1
                                          D      x
                                                   dx
                                     =      10 1
                                           1 x
                                                 dx
                                         log10 x|D+1
                                                   D
                                     =
                                          log10 x|101
                                                    1
                                     = log10 (1 +     )
                                                    D

The last expression is called the Newcomb - Benford’s Law(NBL).

   However, Benford’s law applies not only to scale-invariant data, but also to num-

bers chosen from a variety of different sources. Explaining this fact requires a more

rigorous investigation of central limit-like theorems for the mantissas of random vari-

ables under multiplication. As the number of variables increases, the density function
approaches that of a logarithmic distribution. Hill (1996) rigorously demonstrated

that the ”mixture of distribution” given by random samples taken from a variety of

different distributions is, in fact, Newcomb Benford’s law. Here it will be presented

those results that explain the properties of the NBL.


1.2.2    The Float Point Notation Scheme. Knuth

There are conditions for the leading digit Knuth (1981). He noticed that in order to
account the leading digit’s law its important to observe the way the numbers be writ-

ten in floating point notation. As is suggested the leading digit of u is determined by

log(u) mod 1. The operator r mod (1) represent the fractional part of the number

r, and fu is the normalizing fraction part of u. Let u be a non negative number. Note
8



that the leading digit u is less than d if and only if


                                (log10 u)mod1 < log10 d                            (1.2.3)


since 10fu = 10(log10 u)mod1 . Taken in preference a random number W from a ran-

dom distribution that may occur in Nature, following Knuth, we may expect that

(log10 W )mod1 ∼ Unif (0, 1), at least for a very good approximation. Similarly is

expected that any transformation of U will be distributed in same manner. Therefore

by (1.2.3) the leading digit will be 1, with probability log10 (1 + 1 ) ≈ 30.103%; it will
                                                                    1

be 2 with probability log10 3 − log10 (2) ≈ 17.609% and in general if r is any real value

in [1, 10) we ought to have 10fu ≤ r approximately log10 r of the time. These shows

a vague picture why the leading digits behave the way they do.


1.2.3     In the Float Point Notation Scheme. Hamming

Another approach was suggested by Hamming (1970). Let p(r) be the probability

that 10fU ≤ r, note that r will be in and between 1 and 10 ,(1 ≤ r ≤ 10) and fU is the
normalized fraction part of a random normalized floating point number U. Taking in

account that this distribution in base invariant, suppose that every constant of our

universe are multiplied by a constant factor c; our universe of random floating point

numbers, this will no t affect the p(r). When we multiply, there is a transformation

from (log10 U)mod1 to (log10 U + log10 c)mod1. Let P r(·) be the usual probability

function. Then by definition,


                  p(r) = P r((log10 U)    mod 1 ≤ (log10 r) mod 1)


On the assumptions of (1.2.3), follows
9




         p(r) = P r((log10 U − log10 c)         mod 1 ≤ log10 r)

                 
                 P r((log U mod 1 ≤ log r − log c)
                 
                 
                           10                   10    10
                 
                 
                 or P r((log U − log c) mod 1 ≤ log10 r, if c ≤ r;
                 
                               10          10
               =
                 P r((log U mod 1 ≤ log r + 1 − log c)
                            10                   10       10
                 
                 
                 
                 
                 
                 or P r((log U − log c) mod 1 ≤ log r, if c ≥ r;
                 
                               10          10             10
                 
                 P r( r ) + 1 − P r( 10 ), if c ≤ r;
                       c               c
               =
                 P r( 10r ) + 1 − P r( 10 ) if c ≥ r;
                             c              c


   Until now the values of r are included in the close interval, [1, 10]. To be methodic

it’s important to extend the values of r to values outside the mentioned interval. For
                                                                             10
this is defined p(10n r) = p(r) + n for a positive number n. If we replace     c
                                                                                  by d in

(1.2.4) can be written as:

                                     p(rd) = p(r) + p(d)                          (1.2.4)

Under invariance of the distribution under a constant multiplication hypothesis, then

(1.2.4) will be true for all r > 0 and d ∈ [1, 10]. Since p(1) = 0 and p(10) = 1 then


                     1 = p(10)
                            √
                       = p(( 10)n )
                            n


                            √          √
                       = p(( 10)) + p(( 10)n−1 )
                            n          n


                            √          √          √
                       = p(( 10)) + p(( 10)) + p(( 10)n−2 )
                            n          n          n



                                 .
                                 .
                                 .
                              √
                              n
                        = np(( 10))
10


                                   √             m
hence a derivation of the above p(( n 10m )) =   n
                                                     for all positive integers m and n. Is

suggested the continuity of p, its required that


                                    p(r) = log10 r.                                (1.2.5)


Knuth suggested that to be more rigorous this important to assume that there is

some underlying distribution of numbers F (u); then the desire probability will be


                           p(r) =       (F (10m r) − F (10m))
                                    m

obtained as a result of adding over −∞ < m < ∞. Then the hypothesis of invariance

and the continuity assumption led to (1.2.5) that’s is the desire distribution.


1.2.4     The Brownian Model Scheme. Pietronero

From another particular position, note that this can be viewed as a model for the

overcoming oscillations of the stock market or any complex model in nature. A

brownian model will be acceptable for this type of ”Nature Processes” L. Pietronero

(2001). The brownian motion can be seen as a natural event that involves a change

in the position or location of something. They propose N(t + 1) = ξN(t) where
ξ is a positive definite stochastic variable (just for simplicity). With a logarithmic

transformation then a Gaussian process can be found,


                          log(N(t + 1)) = log(ξ) + log(N(t))


If is consider log(ξ), as a stochastic variable then, log(N(t + 1)), is a brownian move-

ment. Then for t → ∞ P (log(N)) ∼ Unif (0, 1). Transforming the problem to the

original form;
11




                                                                     1
                               P (log10 N)d(log10 N) = C               dN
                                                                     N
   where C is the normalization factor. Is obtained that P (N) ∼ N −1 . This suggest

that the distribution of n is the First Digit Law distribution. By other hand equation

(1.2.5) can be result for any base b. His proposal states that for b > 0 then:
                     n+1                   n+1
                                                                        n+1   logb n+1
    P rob(n) =             N −1 dN =             d(log10 (n)) = log10       =       n
                                                                                         (1.2.6)
                 n                     n                                 n     log10 b

Finally using logarithm properties we get;

                                                               1
                                   P rob(n) = logb (1 +          )                       (1.2.7)
                                                               n

that is a generalization of the Newcomb - Benford’s Law.



1.3     A Statistical Derivation of N-B L

Theodore Hill has given a more general argument to the dimensionless data. He has

explained the Central-limit-like Theorem for Significant Digit by saying:

Remark 1.3.1. ”Roughly speaking, this law says that if probability distributions are

selected at random and random samples are then taken from each of these distri-

butions in any way so that the overall process is scale or (base)neutral, then the

significant - digit frequencies of the combined sample will converge to the logarithmic

distribution” Hill (1996)

   In order to understand such explanation, then here we presented a brief intro-

duction to measure theory. A fundamental concept in the development of the theory

behind the family of outlaws numbers is the mantissa. This permits the isolation of

the groups of significant digits.
12


          1    2    3
     Let Db , Db , Db , . . ., denote the significant digit function(base b).

                                        1               2                  3
Example 1.3.1. As an example note that D10 (25.4) = 2, D10 (25.4) = 5 and D10 (25.4) =

4.

     The exact laws were given by Newcomb (1881) in terms of the significant digits

base 10 are:

                            (1)           (1)                   1          (1)
                  P rob(D10 = d10 ) = log10 (1 +             (1)
                                                                    ); d10 = 1, 2, . . . , 9.       (1.3.1)
                                                            d10



                                      9
               (2)        (2)                               1                (2)
        P rob(D10    =   d10 )   =          log10 (1 +            (2)
                                                                        ); d10 = 0, 1, . . . , 9.   (1.3.2)
                                     k=1                 10k +   d10
This equations show a way to write the NBL in terms of the significant digit.


1.3.1      Mantissa

As we had mention, a way to formalize the form to write numbers in terms of the

digits, then here is introduced the mantissa. The aim of define the mantissa was put

the NBL in a proper countable additive probability framework. Basically the NBL is
a statement in terms of the significant digits functions.

Definition 1.3.1. The mantissa (base 10) of a positive real number x is the unique
               1
number r in ( 10 , 1] with x = r ∗ 10n for some n ∈ Z.

     To be more familiar with the mantissa definition looks up to the scientific notation.

Definition 1.3.2. A number is in scientific notation if it is in the form:

                                  M antissa ∗ 10characteristic
13



, where the mantissa (Latin for makeweight) must be any number 1 through 10

(but not including 10), and the characteristic is an integer indicating the number

of places the decimal moved.

   A more general definition of mantissa can be presented, a generalization for any

base b > 0, that’s is as follows;




Definition 1.3.3. For each integer b > 0, the (base b) mantissa function, Mb , is

the function Mb : R+ → [1, b) such that Mb (x) = r, where r is the unique number in

[1, b) with x = r ∗ 10n for some n ∈ Z. For E ∈ [1, b), let

                            E     b
                                         −1
                                      = Mb (E) =               bn E ⊂ R+
                                                         n∈Z

The (base b) mantissa σ- algebra generated by R+ .

Example 1.3.2. Using the function Mb defined above, we can verify that 9 have the

same mantissa function image for different bases, 10, 100 and 1000. For this note

that M10 (9) = 9, since 9 = r ∗ 10n = 9 ∗ 100 , note that n = 0 and r = 9. The same

case for base b = 100, here n = 2 and r = 9 again.

                                                                   9
   Moreover note that for b = 2 we have M2 (9) =                   8
                                                                       = 1.001 (base 2), since x = 9 23 .,
                                                                                                   8
                                         9
this is close to the definition since     8
                                             ∈ [1, 2).

Remark 1.3.2. Note that the mantissa function, Mb , assigns it a unique value hence

its well define.

   An observation is that if E = [1, b) then E                 b
                                                                      −1
                                                                   = Mb (E) =       n∈N −{1}   bn E = R+ .

And {1}    10   = {10n : n ∈ Z}
14



Lemma 1.3.3. For all b ∈ N − {1},

                      n−1
     (i) E    b   =   k=0    bk E    bn



    (ii) Λb = { Eb : E ∈ B(1, b)}

(iii) Λb ⊂ Λbn ⊂ B for all n ∈ N

    (iv) Λb is closer under scalar multiplication.

Proof. Part (i) of the lemma follows directly from the definition of                                 b;    (ii) fol-
lows from the facto that if E is a Borel set in (1, b) then Λb will denote the set
        n−1
of {    k=0   bk E    bn   E ∈ B(1, b)}. Taking point (i) and (ii) together with naturalism we

get point (iii). The last point of the lemma follows from point (ii) since Λb is the
                                          (1)    (2)    (3)
σ-algebra generated by {Db , Db , Db , . . .}

      For a more general case of the s-digit law we have:

                                                       1
                            P rob(mantissa ≤              ) = log10 (t); t ∈ [1, 10) ⊆ ℵ                   (1.3.3)
                                                       10

.

Since we can write (1.3.3) using the digits we have:

Definition 1.3.4. General Significant Digit Law For all positive integer k, all

dj ∈ {0, 1, 2, . . .}


              (1)      (1)     (2)         (2)          (k)    (k)                         1
    P rob(D10 = d10 , D10 = d10 , . . . , D10 = d10 ) = log10 [1 +                k      (i)
                                                                                                         ] (1.3.4)
                                                                                  i=1   d10    ∗ 10k−i
Corollary 1.3.4. The significant digits are dependent.
15



   This corollary can be proof giving a counter example, that’s the way that Hill

work it out. Now is important to state a natural probability space in which we can

describe in a proper form every detail in each Newcomb Benford’s Law scheme. At

this point is needed a strong measure theory tools, as the σ-fields generated by the

set of the r significant digits.


1.3.2     A Natural Probability Space

Let the sample space R+ be the set of positive real numbers. And let the sigma alge-
                                                       (1)     (2)   (3)
bra of events simply be the σ-field generated by{D10 , D10 , D10 , . . .} or equivalently,

generated by mantissa function: x → mantissa(x). This σ-algebra denoted by Λ and

will be called the decimal mantissa σ-algebra. This is a subfield σ of the Borel’s sets

and
                                               ∞
                              S∈Λ⇔S=                 B ∗ 10n                         (1.3.5)
                                              n=−∞
                                                                           ∞
for some Borel B ⊆ [1, 10), which is the generalization of D1 =            n=−∞ [1, 2)   ∗ 10n

that’s is the set of positive numbers witch first digit is 1.




1.3.3     Mantissa σ-algebra Properties

The mantissa σ-algebra have several properties;

1. Every non empty set in Λ is infinity with accumulation point at 0 and at +∞.

2. Λ is closer under scalar multiplication.

3. Λ is closer under integral roots, but not powers.
16



4. Λ is self - similar in the sense that S ∈ Λ, then 10m ∗ S = S for every integer m.



      Here aS and S a represent respectively {as : s ∈ S} and {sa : s ∈ S}. The first

property implies that finite intervals are not include in Λ, are not expressible in term

of the significant digits. Note that significant digits alone can no be distinguished
between the numbers 10 and 100 and thus the countable additive contradiction as-

sociated with the scale invariance disappear. Properties 1, 2 and 4 follow directly by

1.3.5 but the closure under integral roots needs more details. The square root of a

set in Lambda may need two parts and similarly for higher roots.

Example 1.3.5. If
                                                  ∞
                             S = {D1 = 1} =               [1, 2) ∗ 10n ,
                                                n=−∞

then

                       ∞                        ∞
                 1
                                         n
               S =
                 2         [1,   (2)) ∗ 10 ∪          [    (10),     (20)) ∗ 10n ∈ Λ
                     n=−∞                      n=−∞

but

                                        ∞
                                  2
                                 S =         [1, 4) ∗ 102∗n ∈ Λ
                                       n=−∞

Since it has gaps (which are too large) and thus can not be written in terms of the

digits.

      Just as property 2 is the key to the hypotheses of the scale invariance, property 4

is for base invariance a well.
17



1.3.4      Scale and Base Invariance

The mantissa σ−algebra Λ represent a proper measurability structure. In order to

be rigorous is time to state a proper definition of a scale invariant measure.

Definition 1.3.5. A probability measure P on (R+ , Λ) is scale invariant if P (S) =

P (sS) for s > 0 and all S ∈ Λ.

   The NB Law 1.3.3 1.3.4 is characterize by the scale invariance property.

Theorem 1.3.6. Hill (1995a)A probability measure P on (R+ , Λ) is scale invariant

if and only if
                                      ∞
                                P(         [1, t) ∗ 10n ) = log10 t             (1.3.6)
                                     n=−∞

for all t ∈ [1, 10).

Definition 1.3.6. A probability measure P on (R+ , Λ) is base invariant if P (S) =
     1
P (S 2 ) for all positive integers n and all S ∈ Λ.

   Observe that the set of numbers



                       St = {Dt = t, Dj = 0∀j = t ∧ t ∈ [1, 10)}

                       =          {. . . , 0.0t, 0.t, t, t0, t00, . . .}
                                            ∞
                       =                    n=−∞ [1, t)   ∗ 10n

   has by 1.3.5 no nonempty Λ− measurable subsets. Recall the definition of a Dirac

measure:

Definition 1.3.7. The Dirac measure δt associated to a point St ∈ Λ is defined as

follows: δt (St ) = t if t ∈ St and δt (St ) = 0 if t ∈ St
18



   Using the above definition and letting PL denote the logarithmic probability dis-

tribution on (R+ , Λ) given in 1.3.3, a complete characterization for base-invariant

significant- digit probability measures can now be given.

Theorem 1.3.7. Hill (1995a)A probability measure P on (R+ , Λ) is base invariant
if and only if

                                 P = qPL + (1 − q)δt

for some q ∈ [0, 1]

   Note that P is as a convex combination of the two measures; PL and δt . Using

theorems 1.3.6 and 1.3.7 T. Hill state that scale invariance implies base invariant but

not conversely. This is because δt is base invariant but not scale invariant. The proof

of those theorems are not relevant but important in the sense of resume the statistical

derivation presented by T. Hill. Recall that a (real Borel) random probability measure
(r.p.m.) M is a random vector (on an underlying probability space (Ω; F; P ) taking

values which are Borel probability measures on R, and which is regular in the sense

that for each Borel set B ⊂ R, M (B) is a random variable.

Definition 1.3.8. The expected distribution measure of r.p.m F is the probability

measure EF (on the borel subsets of R) defined by

                      (EM )(B) = E(M (B))for all Borel B ⊂ R                    (1.3.7)

(where here and throughout, E( ) denotes expectation with respect to P on the

underlying probability space).

   The next definition plays a central role in this section, and formalizes the concept

of the following natural process which mimics Benford’s data-collection procedure:
19



pick a distribution at random and take a sample of size k from this distribution; then

pick a second distribution at random, and take a sample of size k from this second

distribution, and so forth.

Definition 1.3.9. For a r.p.m M and positive integer k, a sequence of

M − randomk − samples is a sequence of random variables X1 , X2 . . . on (Ω; F; P )
so that for some i.i.d. sequence M1 , M2 , . . . of r.p.m.’s with the same distribution as

M , and for each j = 1, 2, . . . given Mj = P , the random variables X(j−1)k+1 . . . , Xjk

are i.i.d. with d.f. P ; and X(j−1)k+1 . . . , Xjk are independent of {Mi; X(i−1)k+1 , . . . , Xik

for all i = j. The following lemma shows the somewhat curious structure of such se-

quences.

Lemma 1.3.8. Hill (1995a) Let X1 , X2 . . . be a sequence of M −randomk −samples

for some k and some r.p.m. M . Then

(i) the Xn are a.s. identically distributed with distribution EM , but are not in

      general independent, and

(ii) given M1 , M2 , . . ., the Xn are a.s. independent, but are not in general identically

      distributed.

   As Hill state in his paper:

Remark 1.3.3. In general, sequences of M − randomk − samples are not in-
dependent, not exchangeable, not Markov, not martingale, and not stationary se-

quences.

Example 1.3.9. Let M be a random measure which is the Dirac probability measure
                           1                      δ(1)+δ(2)
δ(1) at 1 with probability 2 , and which is           2
                                                              otherwise, and let k = 3. Then M1
                                              1
will be assigned to δ(1) with probability     2
                                                  and M2 otherwise.
20



(i) Since
                                                                                       11  1
         P (X2 = 2) = P (X2 = 2|M1 )P (M1 ) + P (X2 = 2|M2 )P (M2 ) = 0 +                 = ,
                                                                                       22  4
                                               1
      but P (X2 = 2|X1 = 2) = P (x2 = 2|M2 ) = 2 , so X1 , X2 are not independent.

                                                          9        3
(ii) Since P ((X1, X2 , X3 , X4 ) = (1, 1, 1, 2)) =      64
                                                              >   64
                                                                       = P ((X1 , X2 , X3 , X4 ) =

      (2, 1, 1, 1)), the (Xn ) are not exchangeable;

(iii) Since
                                                     9  5
                    P (X3 = 1|X1 = X2 = 1) =           > = P (X3 = 1|X2 = 1),
                                                    10  6
      the (Xn ) are not Markov.

(iv) since
                                                          3
                                      E(X2 |X1 = 2) =       = 2,
                                                          2
      the (Xn ) are not a martingale;

(v) and since
                                                9    15
              P (X1, X2 , X3 ) = (1, 1, 1)) =      >    = P ((X2, X3, X4) = (1, 1, 1)),
                                                16   32
      the (Xn ) are not stationary.

   The next lemma is simply the statement of the intuitively fact that the empirical

distribution of M − randomk − samples converges to the expected distribution
of M .

Lemma 1.3.10. Hill (1995a) Let M be a r.p.m., and let X1 , X2 . . . be a sequence of

IM − randomk − samples for some k. Then
                                   ♯{i ≤ n : Xi ∈ B}
                             lim                     = E[M (B)]                            (1.3.8)
                            n→∞             n
a.s. for all Borel B ⊂ R.
21



   Note that if we choose k = 1, taking fix B and j ∈ N, and let

                                      Yj = ♯{Xj ∈ B

then
                                                               m
                              ♯{i ≤ n : Xi ∈ B}                j=1 Yj
                          lim                   = lim                            (1.3.9)
                         n→∞           n          n→∞          n
By 1.3.8 , given Mj ,Yj as the Bernoulli case with parameter 1 and E[M j (B)], so by

1.3.7

                            EYj = E(E(Yj |M j )) = E[M (B)]                    (1.3.10)

a.s. for all j, since M j has the same distribution of M . By 1.3.8 the {Yj } are

independent. Since they have 1.3.10 identical means E[M (B)], and are uniformly
bounded, it follows Lo`ve (1977) that
                      e
                                       m
                                       j=1 Yj
                                lim             = E[M j (B)]                   (1.3.11)
                               n→∞     m

a.s. This basically it is just the Bernoulli case of the strong law of large numbers.

Remark 1.3.4. Roughly speaking, this law says that if probability distributions are

selected at random, and random samples are then taken from each of these distri-

butions in any way so that the overall process is scale (or base) neutral, then the
significant digit frequencies of the combined sample will converge to the logarithmic

distribution.

   At this far a proper definition of a random sequence in terms of the mantissa is

expressed.

Definition 1.3.10. A sequence of random variables X1 , X2 . . . has scale-neutral man-

tissa frequency if

                     |♯{i ≤ n : Xi ∈ S} − ♯{i ≤ n : Xi ∈ sS}|
                                                              → 0 a.s.
                                        n
22



for all s > 0 and all S ∈ M, and has base-neutral mantissa frequency if
                                                               1
                   |♯{i ≤ n : Xi ∈ S} − ♯{i ≤ n : Si ∈ S 2 }|
                                                              → 0 a.s.
                                      n

for all m ∈ N and S ∈ M.

Definition 1.3.11. A r.p.m. M is scale-unbiased if its expected distribution EM is

scale invariant on (R+ , bmM) and is base-unbiased if E[M (B)] is base-invariant on

(R+ , bmM). (Recall that M is a sub σ-algebra of the Borel, so every Borel probability

on R (such as EM ) induces a unique probability on (R+ , M ).)

   The main new statistical result, here M (t) denotes the random variable M (Dt ),

where
                                          ∞
                                  Dt =          [1, t) × 10n
                                         n=−∞
                                                   1    t
is the set of positive numbers with mantissa in [ 10 , 10 ). M (t) may be viewed as the

random cumulative distribution function for the mantissa of the r.p.m. M .

Theorem 1.3.11. (Log-limit law for significant digits). Let M be a r.p.m. on

(R+ , bmM). The following are equivalent:

(i) M is scale-unbiased

(ii) M is base-unbiased and E[M (B)] is atomless;

(iii) E[M (t)] = log10 t for all t ∈ [1, 10);

(iv) every M -random k-sample has scale-neutral mantissa frequency;

(v) EM is atomless, and every M -random k-sample has base-neutral mantissa fre-
      quency;
23



(vi) for every M -random k-sample X1 , X2 . . . ,
                                        1    t
             ♯{i ≤ n|mantissa(Xi ) ∈ [ 10 , 10 )}
                                                  → log10 t a.s. for all t ∈ [1, 10).
                           n

   The statistical log-limit significant-digit law help justify some of the recent appli-

cations of Newcomb Benford’s Law, several of which will now be described. Remember
that most of the results on this section are transcribed and commented using Hill

(1996). The proof of each one of the results are included by referencing each lemma

and theorem.



                                  k
1.4     Mean and Variance of the Db

The numerical values of the Significant Digit Law for the first digit can be computed

numerically using these expressions:
                                               9
                               (k)                            (k)
                            E(Db )        =         nP rob(Db = n)                      (1.4.1)
                                              i=1

                                      9
                           (k)                          (k)          (k)
                   V   ar(Db )   =         n2 P rob(Db = n) − E(Db )2                   (1.4.2)
                                     i=1

If we state the values for k = 1 to 9. As an example lets suppose as usual that b = 10

then we already know the theoretical values for the distribution of the first and second

significant digit. Then some statistics for this distribution are: The standard devia-

tion is the well know distance of the point to the mean and the variation coefficient

is the ratio between the standard deviation and the mean of the distribution.

   These are the most central tendency measured used bay researchers.
24



Table 1.1: Mean, Variance, Standard Deviation and Variation Coefficient for the First
and Second Significant Digit Distributions.

                       Mean V ariance ST DEV V ariation
               F irst 3.44024 6.05651  2.46099  0.71536
               Second 4.18739 8.25378  2.87294  0.68609



1.5     Simulation

Let as usual X be a random variable having Benford’s distribution. Using 1.3.6, then

X can be generated via

                                    X ← ⌊10U ⌋                               (1.5.1)

where U ∼ U nif (0, 1). Note that the operator ⌊⌋ represent the integer part of the

number between the symbols. Actually the above expression if for the first signifi-
cant digit. The interesting case is how to generate random values from each of the

marginates of the Generalized Newcomb Benford’s distribution for all digits not only

the first. Moreover if there is some bound on the maximum number N (like in elec-

tions). How it would be a ”Newcomb Benford’s Law under a restriction?” how bounds

affect the sample generated?


1.5.1     Generating r Significant Digit’s Distribution Base b.

For this remember that the Significant Digit Law can be stated as:

                                                  1
                               Fx (x) = log10 (1 + )                         (1.5.2)
                                                  x
25



for x = 10r−1 , 10r−1 +1, . . . , 10r −1. Then going directly to the definition of probability

the expression above can be written as:
          Fx (x) = P r(X ≤ x)
                x                   1
          =     i=10r−1 log10 (1 + i )
                x
          =     i=10r−1 (log10 (i + 1)   − log10 i)                                  (1.5.3)
          = log10 (x + 1) − log10 10r−1 (by the hypergeometric series)
          = log10 (x + 1) − r + 1
Hence for the cumulative distribution function can be stated as:

                               Fx (x) = log10 (x + 1) − r + 1.                       (1.5.4)

Note that the same derivation can be done using an arbitrary base b. In order to

generate values from this distribution lets suppose that u ∼ U nif (0, 1), and also

suppose as usual that, u = Fx (x). Substituting is 1.5.2, and solving for x are get:

                                         10u+r−1 − 1 = x

using the floor function to get a closed form expression,

                                          X ∼ ⌊10u+r−1 ⌋.                            (1.5.5)

Moreover this can be generalized to a base b > 0, for which

                                           X ∼ ⌊bu+r−1 ⌋                             (1.5.6)

where U ∼ U nif (0, 1).


1.5.2      Effects of Bounds in the Newcomb-Benford Generated

          Values.

There is an open question; if there is an upper bound of the data values, what

effect this have if any, on Newcomb Benford’s Law? For this, suppose as above that
26



X ∼ NBenf ord(r, b), that is that X is a random variable distributed as a Newcomb

- Benford’s Law for a digit r and base b > 0. Using equation 1.5.5 we can generate

the marginal distribution applying a modular function base b. Thats is


                                 X ∼ ⌊bU +r−1 ⌋mod b                                  (1.5.7)


with U ∼ U (0, 1) Note that for b = 10 and r = 2, expression 1.5.5 will generate

numbers from the set {10, 11, 12, . . . , 99}. Lets define K be an upper bound, for

experimental observations :
                                   X ← ⌊X U +2−1 ⌋                                    (1.5.8)

Then

                                Z = ⌊10U +1 ⌋I (0,K] (z)                              (1.5.9)

where I (0,K] is the indicator function defined as



                                           1, x ∈ S;
                            I S (x) =
                                           0, otherwise.

   When we use r = 2 we are generating from the second digit law. There are some

complications at the moment of generate numbers from 1 to 99. Since for this case

there are two different types of numbers, from 1 to 9, the case that the number of

digits is one, and second the numbers from 10 to 99, the number of digits is two.

Since the equation 1.5.7 depends on the number of digits to simulate, there is the
need to simulate proportionally form the set of numbers from 1 to 9 and the set of
                                                                 1         8
numbers 10 to 99. The proportion of the first set of numbers is   9
                                                                     and   9
                                                                               for the second

set. The trick here is to generate 1/9 of the sample size using the random numbers

from a N-B Distribution with r = 1 and the other 8/9 of the desired sample form a
27



N-B Distribution with r = 2. This can be generalized for larger r’s. The main topic

in this section is know the way that the N-B Law acts with bounds. For this some

notation is needed.

(i) pB is the Newcomb Benford Probability Distribution for number i.
     i


(ii) pC under the constraint N ≤ K is the proportion of the numbers in the set that
      i

     will be sampled;

(iii) pU is the proportion of the numbers in the set under no constraints;

As an observation, if there is not a bound then pc = pu .

                                                    11
Example 1.5.1. Suppose that K = 52 then pC =
                                         1          52
                                                         and pU = 1 .
                                                              1   9


Definition 1.5.1. The ”Constrained N-B Law Distribution” is defined as:
                                                      j     pC
                                                 pB (Di ) pi
                                                           U
                        P (Di = x|N ≤ M) =                  i
                                                                 C             (1.5.10)
                                                     B  j pk
                                                  k p (Dk ) pU
                                                             k


   Suppose that we want a bound in N = 65 then lets compare how close the theo-

retical function 1.5.10 is close to the simulated using the bound. The following figures

present different simulations with different bounds or constraints;
   The conclusion is that the argument at the theoretical Law under constraints

1.5.10 and the simulation is excellent. In fact, equation 1.5.10 may be considerate

the ”Constrained Newcomb Benford Law”. To our knowledge this is the first time it

has been introduced. Note that 1.5.10 can be adapted for lower bound also.
28




                                                           Bound in:99 of generated numbers Benford Dist.
                                              1
                                                                                                  Simulated bound
                                             0.9                                                  Theory bound
                                                                                                  NB law
                                             0.8

                                             0.7
                  Probability Distribution


                                             0.6

                                             0.5

                                             0.4

                                             0.3

                                             0.2

                                             0.1

                                              0
                                                   1   2       3        4       5        6        7         8       9
                                                                             Numbers



Figure 1.1: Constrained Newcomb Benford Law compared with a Restricted Bound with
of digits in K ≤ 99 from numbers between 1 to 99. Here there is no restriction.



                                                           Bound in:55 of generated numbers Benford Dist.
                                              1
                                                                                                  Simulated bound
                                             0.9                                                  Theory bound
                                                                                                  NB law
                                             0.8

                                             0.7
                  Probability Distribution




                                             0.6

                                             0.5

                                             0.4

                                             0.3

                                             0.2

                                             0.1

                                              0
                                                   1   2       3        4       5        6        7         8       9
                                                                             Numbers



Figure 1.2: Constrained Newcomb Benford Law compared with a Restricted Bound with
of digits in K ≤ 50 from numbers between 1 to 99.
29




                                                          Bound in:25 of generated numbers Benford Dist.
                                             1
                                                                                                 Simulated bound
                                            0.9                                                  Theory bound
                                                                                                 NB law
                                            0.8

                                            0.7
                 Probability Distribution




                                            0.6

                                            0.5

                                            0.4

                                            0.3

                                            0.2

                                            0.1

                                             0
                                                  1   2       3        4       5        6        7         8       9
                                                                            Numbers



Figure 1.3: Constrained Newcomb Benford Law compared with a Restricted Bound with
of digits in K ≤ 20 from numbers between 1 to 99.
Chapter 2

Empirical Analysis


2.1      Introduction

The analysis of the uncertainty is old as the civilization itself. There are several

interpretations of those phenomenons that rule the Nature in the most general case.

In modern times the basis of this theory are lectures from Bernoulli, Laplace and
Thomas Bayes. Characterize the knowledge in chance and uncertainty using measure

tools provided by the Logic is the fundamental baseline in most of the results here.

There are distinctions in Classical and Bayesian Statistics. We discuss at least the null

hypothesis probabilities and the p-value. Explore a concise analysis of the Newcomb

Benford sequences in a Bayesian scheme using state to the art tools in order to

calculate how close the data is close to the Law. Finally we state some examples in
order to taste how the Newcomb Benford Law works as the mixture in probability

random variables get more complicate.




                                           30
31



2.2      Changing P-Values in Null Hypothesis Prob-

         abilities H0

The p- value is the probability of getting values of the test statistic as extreme as, or

more extreme than, that observed if the null hypothesis is true. For a single sample

the χ2
     Statistics is given as,

                                          9            (1)
                                               (P rob(D10 = D) − fD )2
                         χ2
                          Statistics =                       (1)
                                                                                   (2.2.1)
                                         D=1      P rob(D10 = D)
Where fD , are the first digit of each data entry. This is the basis of a classical test to

the null hypothesis which is that the data follows the Newcomb Benford Law. If the
null hypothesis is accepted the data ”passed” the test. If not, it opens the possibility

of being manipulated data. As we had presented most of the data that we pretend

to analyze respond as a group of random mix of models. We have to specify our null,

H0 , and is alternative, H1 , hypothesis. In the electoral process sense is; H0 means

that there are no intervention with the data on the other hand, H1 means that there

is intervention in the data gathering process. Is important to get the Null Hypotheses
measure against it is evidence. In our case if the data obeys Benford´ Law implies
                                                                     s

that there is no intervention in the electoral votes.



   There is a misunderstood between the probability of the Null Hypothesis and the

p − value.

For a Null Hypotheses, H0 , we have (Berger O. J., 2001):

                       Pval = P rob(Result equal or more extreme

                               than the data—Null Hypothesis)
32


[tbp]

              Table 2.1: p- values in terms of Hypotheses probabilities.

                                   Pval  P (H0 |data)
                                   0.05          0.29
                                   0.01          0.11
                                   0.001      0.0184



If the p- values is small (Ex. p-values < 0.05 or less) there is a significant observation.

But p- values are not null hypotheses probabilities. If P (H0 ) = P (H1 ),surprise that

H0 has produced an unusual observation, and for Pval < e−1 , then:


                          P (H0 |data)
                                       ≥ −e.Pval . loge [Pval ] ⇒
                          P (H1 |data)
              P (H0|Pval ) ≥ P (H0 |data) = 1/(1 + [−e.Pval . loge (Pval )]−1 )



A full discussion about this mater can be found in, (Berger O. J., 2001) Is more

natural to calculate p-value with respect the goodness of fit test of the proportions in

the observed digits versus those in proportions specified by the Newcomb-Benford’s
Law. As we can see in the 2.2, the correction is quite important in order to improve

the calculations. This table show how larger this lower bound is than the p- values.

So small p- values (i.e.pval = 0.05), imply that the posterior probability at the null

hypotheses is at least 0.29, is not very strong evidence to reject a hypotheses. As

an alternative procedure we can use the BIC (Bayesian Information Criterion) or the

Schwartz’s Criteria (Berger J.O. and Pericchi L. R., 2001) that take the sample size
in explicit form:
33




                     P (H0|data)                             k1 − k0
              log[               ] ≈ log(Likelihood Ratio) +         log(N)                  (2.2.2)
                     P (H1|data)                                2
   The likelihood ratio can be calculated from a multinomial density distribution.

In the numerator we will have the proportions assigned from the H0 and in the

denominator the data digit proportions. The evidence against the null hypothesis

can be measure using the BIC. In this case the null hypothesis represents that the

data follows a N-Benford’s Law distribution.


2.2.1     Posterior Probabilities with Uniform Priors

Let Υ1 be set of integers in [1, 9] and Υ2 be the integers included in the interval [0, 9].

The elements that may appear when the first digit be observed will be members of
Υ1 and if we were meant to observe the second digit or other different from the first

position will be member of Υ2 . In the case that the same index can be applied to the

first or any other digit member of Υ1 or Υ2 we will refereed as a member of Υ. Let
                                                                     k
                     Ω = {p1 = p01 , p2 = p20 , . . . , pk = pk0 |         p0i = 1}
                                                                     i=1

for k = 1, . . . , 9 in the case of the first digit and k will be extended to 10 for other

digit. Note that using the defined set above we can rewrite Ω in terms of Υ as follows;

              Ω = {p1 = p01 , p2 = p20 , . . . , pk = pk0∀k ∈ Υ|                 p0i = 1}.
                                                                           i∈Υ

Then our hypothesis can be write as:

                                             H0 = Ω
                                                                                             (2.2.3)
                                             H1 = Ωc
where Ωc means the complement of Ω. In other words

                                     Ωc = {pi = p0i ∀i ∈ Υ}.
34



Assume an uniform prior for the values of the pi s, then,



                              Πu (p1 , p2 , . . . , pk ) = Γ(k) = (k − 1)!                               (2.2.4)

We can write the posterior probability of H0 in terms of the Bayes Factor. Let x be

de data vector and by definition of Bayes Factor we have that:
                                                 P (H0|x)P (H1)
                                       B01 =                                                             (2.2.5)
                                                 P (H1|x)P (H0)
If we have nested models and P (H0 ) = P (H1) = 1 , then the Bayes Factor reduces to
                                                2
                                                P (H0 |x)
                                     B01 =      P (H1 |x)


                                                     P (H0 |x)
                                                =   1−P (H0 |x)


                                                          1
                                                =       1
                                                    P (H0 |   x) −1                                      (2.2.6)

                                                    1                  1
                                     ⇔          P (H0 |x)
                                                              −1=     B01
                                                    1              1
                                     ⇔          P (H0 |x)
                                                              =   B01
                                                                       +   1
                                                    1             B01 +1
                                     ⇔          P (H0 |x)
                                                              =    B01



therefore
                                                           B01
                                         P (H0|x) =                                                      (2.2.7)
                                                          B01 + 1
For the i significant digit of each element of the data vector n = (n1 , n2 , . . . , nk ) that

ni means the times that appear i in each element of the data. Recall that if we observe

the first digit then i ∈ Υ1 but for the second and onwards i ∈ Υ2 , or more general

as the convention i ∈ Υ for any of the cases above. Using the definition applied to
problem 2.2.3, we have
                                              f (n1 , n2 , n3 , . . . , nk |Ω)
      B01 =
               Ωc
                  f (n1 , n2 , n3 , . . . , nk |Ωc )ΠU (p1 , p2 , p3 , . . . , pk )dp1 dp2 dp3 . . . dpk−1
35



with    i∈Υ   pi = 1 and pi ≥ 0∀i ∈ Υ. Substituting in our problem
                                                               k
                                               k
                                                n!
                                                     ni !      i=1   pn i
                                                                      i0
                       B01 =              +∞
                                               i=1
                                                                     k
                               (k − 1)!   −∞         k
                                                       n!
                                                            ni !     i=1    pni +1−1 dpi
                                                                             i0
                                                     i=1


   Cancel several factorial terms and using the following identity:
                            +∞ k                                   k
                                                                     Γ(ni + 1)
                                      pni +1−1 dpi =
                                       i0
                                                                   i=1
                           −∞   i=1
                                                                   Γ(n + k)

Follows to a simplified expression for B01 :

                                              pn 1 pn 2 · · · pn k
                                               10 20           k0
                                B01 =                        k                              (2.2.8)
                                                                Γ(ni +1)
                                          (k − 1)!           i=1
                                                              Γ(n+k)

Then we already know how get the posterior probability using the Bayes Factor (using

2.2.7) then substituting B01 we have:
                                                        n1   2 n    kn
                                                       p10 p20 ···pk0
                                                          k   Γ(ni +1)
                                                (k−1)! i=1  Γ(n+k)
                             P (H0 |x) =          n1 n2      nk                             (2.2.9)
                                                 p10 p20 ···pk0
                                                      k    Γ(ni +1)
                                                                            +1
                                              (k−1)! i=1Γ(n+k)


   There are different forms to calculate the probability of the null hypothesis given

a certain data; each one depends on the prior’s knowledge and the type of Bayes

Factor or an approximation in use (i.e. P (Frac) is based on the Fractional Bayes

Factor (Berger J.O. and Pericchi L. R., 2001))

                                                    f0 (data|p0 )
                           BF01RAC =
                             F
                                                f1 (data|p)π N (p)d(p)
                                              Ω  r                    (2.2.10)
                                        data|p)πN (p)d(p)
                                               f n(
                                         r=   Ω 1
                                       f n (data|p0 )

where p0 is given by the Newcomb Benford Law and r is the number of adjustable

parameters minus one, that is r = 8 or r = 9, for the first and second digit respectively.

The P (Approx) is based on the following approximation on the Bayes factor;

                             Approx           f0 (data|p0 ) 1− r n r
                           BF01     =(                     ) n ( )2                        (2.2.11)
                                              f1 (data|p)
                                                        ˆ        r
36



where p in the maximum likelihood estimator of p. And the GBIC is based on a still
      ˆ

unpublished proposal by (Berger J.O., 1991) This is based on the prior in (Berger

J.O., 1985).



2.3       Multinomial Model Proposal

In the following case let ti digit then i ∈ Υ as usual. This can be think (Ley, 1996) as

a random variable N distributed a multinomial distribution with vector parameter
θ; thus
                                             (     j∈Υ    nj )!           n
                                f (N |θ) =                               θj                 (2.3.1)
                                                   j∈Υ nj !        j∈Υ
                                                                                        1
As usual we will assume uniform as a prior knowledge for theta whit mean                k
                                                                                            where

k be the cardinality of the set Υ, thats means if we are working with the first digit

then k = |Υ1 | = 9 and if the observes significant digit is the second or more then

k = |Υ2 = 10, that is for each one of the θ. The natural conjugate prior is a Dirichlet

density. This distribution has the following general form;
                                                  k                      k−1
                          Dik (θ|α) = c(1 −             θl )   αk+1 −1
                                                                               pαl −1
                                                                                l           (2.3.2)
                                                  l=1                    l=1

Where
                                                    k+1
                                             Γ(     l=1 αl )
                                      c=          k+1
                                                  l=1 Γαl

and α = (α1 , α2 , . . . , αk+1 ) such that every α > 0 and p = (p1 , p2 , . . . , pk ) with
                   k
0 < pi < 1 and     l=1   pl = 1. For simplicity we will use each αi = α; thus

                                             Γ(kα)
                                   g(p) =             pα−1                                  (2.3.3)
                                             Γ(α)k j∈Υ j
37



The posterior distribution of the p is given by a Dirichlet whit parameter {α + n1 , α +

n2 , . . . , α + n9 }. Then we have that

                                     Γ(kα + j∈Υ nj )        α+n −1
                          h(θ|x) =                         θj j                  (2.3.4)
                                        j∈Υ Γ(α + xj ) j∈Υ




2.4      Examples

Our aim now is to show empirically how efficient can be the reasoning 1.3.1 given

by (Hill, 1996). Our first examples denote an exponential family distribution func-

tion. Most of the application involve that involve a multilevel analysis are called
a hierarchical models. This type of models allow a more ”objective” approach to

inference by estimating the parameters of prior distributions from data rather than

requiring them to be specified using subjective information (Gelman A., 1995, Carlin

Bradley P., 2000).

Example 2.4.1. The simplest model that we present here is a Poisson Model with a

fixed parameter λ. For this first case the 500 values are simulated with λ = 100. The

P (H0 |data) = 0, that indicate how poor is this model to simulate a Benford Process.
As Hill stated and as we had discussed in early chapters, the NB Law can be satisfied

if there is a random mixture of mixture distributions. In the Figure 2.2 is show how

poor is the frequencies of the first digit of the simulated values compared with the N-B

Law for the first digit.

 Remember that this model is the simple one, do not have a hierarchical structure.

Example 2.4.2. The following is a simple hierarchical model have two stages some
of the parameters are fixed. A frequently model used in actuarial sciences, and quality
38




                                                         Marginal Posterior Boxplot of Newcomb Benford for First Digit.


                                       0.3



                                0.25
                   Proportion




                                       0.2



                                0.15



                                       0.1



                                0.05

                                                     1        2          3            4     5               6       7         8         9
                                                                                          Number

                                                                  (a)First digit boxplot.
                                                                  Marginal Posterior Boxplot of Newcomb Benford for Second Digit.


                                              0.13




                                              0.12




                                              0.11
                                Proportions




                                               0.1




                                              0.09




                                              0.08




                                                     0        1         2         3        4            5       6       7           8       9
                                                                                               Number

                                                             (b) Second digit boxplot.
Figure 2.1: Presenting the posterior intervals for the first and digit using symmetric
boxplot.
39



control.
                                      n ∼ P ois(λν)
                                                                                   (2.4.1)
                                      λ ∼ G(θ, α)
The probability distribution is given by
                                               1
                        P g(n|α, β, ν) =           P ois(n|λν)G(λ|α, β)dλ
                                           0

The resulting expression is know as the generalization of negative binomial distribution
         β
Nb(n|α, β+ν ).


                                       1
                                      e−λν (λν)n β α λα−1 e−βλ
                 P g(n|α, β, ν) =                              dλ
                                    0  Γ(n + 1)       Γ(α)
                                                    1
                                       β αν n
                                 =                    λn+α−1 e−(β+ν)λ dλ
                                   Γ(α)Γ(n + 1) 0
                                                            α           n
                                    Γ(n + α)ν n       β           ν
                                 =
                                   Γ(n + 1)Γ(α) β + ν           β+ν

   First we will think the Gamma part of the model above, as a mixture of different

distributions of the parameter λ in the Poisson distribution function.

The values of the different values of the set of parameters λ will be {10, 20, 30, 50, 70}.

Each vector of the overall simulated data will correspond to the Poisson model whit

partitions of length 50. Making the Benford analysis we get P (H0 |data) = 0.878719187.

Here we can denote that for this small examples of mixtures the Newcomb Benford
Law works. Note that in the graph Figure 2.3 is show how close the real Law is to

the simulated values.




   In the Model 2.4.1, instead of use the discrete version for the λ distribution here

we simulate using a Uniform prior on the parameters in the Gamma distribution
40



function. The model that in this example is implemented goes as follows.

                                 n∼       P ois(λν)

                                 λ∼       G(α, β)                                  (2.4.2)

                                 α, β ∼ Unif (1, 500)

This simulation is an extension of the model 2.4.1. In general this is a Negative

Binomial family of distributions, indeed is a mixture of distributions itself. In Figure 3

we can appreciate the histogram of the cumulative distribution(a) and the proportions

of the significant digits whit the N-B Law for the first digit law proportions. Here the

probability of the null hypothesis given the data is 1. The table 1 show a resume of
the overall results.

Example 2.4.3. The Multinomial Model is a rich source of mixtures since that if you

are observing an electoral process you can seen different parameters for the probability

values of each candidate per region in a country. As a little experiment suppose that

you have two candidates and some of the persons in a electoral college of a particular

country do not want to vote then for that particular region you will have a parameter
vector p = [p1 , p2 , p3 ] whit p1 + p2 < 1 and p3 = 1 − p1 − p2 . Recall that p3 is the

probability of people that do not vote for any of the candidates. For this particular

simulation 1000 electoral colleges are simulated in 10 regions. As we had said there

are two candidates. The joint density function of all data is presented in Figure 4.

The P (H0 |data) = 1 for 29058 simulated data. A summary of the examples are

presented in Table 2.4.
41



             Table 2.2: Summary of the results of the above examples.

      Example                Simulated length of data P (H0 |data) p − value
      Poisson Model                               500            0         0
      Pois-Gamma Discrete                         250        0.991     0.008
      Neg - Binomial                              500        0.989     0.001
      Multinomial                              29058         0.999     0.002



2.5      Conclusions of the examples

Note that since you complicate the hierarchy in each model then an approach to the

N-B Law frequencies in the first digit can be found easily. More complicated is the

model, more the approach to the N-B Law. The restrictions in the parameters affect

the statistical closeness to the Benford Law.
42




                                                     Histogram of the Simplest Poisson Model with λ = 100.
                                   150




                  Frequencies
                                   100


                                      50


                                       0
                                        60      70          80            90        100      110     120      130          140
                                                                                   Values
                                                                               Poisson model
                                      0.8
                                                                                                   1st Digit Law.
                                      0.6                                                          Empirical simulation.
                     Proportions




                                      0.4

                                      0.2

                                       0
                                            1    2             3          4          5       6        7         8           9
                                                                                   Digits



 Figure 2.2: Newcomb-Benford Law theoretical frequencies for the first significant digit.




                Histogram of the Poisson model whit the partition according to the different lambda parameters.
                       60
                        Frequencies




                                      40


                                      20


                                       0
                                            0   10        20         30       40    50     60    70             80         90
                                                                               Values
                                                                   Discrette Gamma−Poisson model
                                      0.4
                                                                                                   1st Digit Law.
                                      0.3                                                          Empirical simulation.
                     Proportions




                                      0.2

                                      0.1

                                       0
                                            1    2             3          4          5       6        7         8           9
                                                                                   Digits



Figure 2.3: Newcomb-Benford Law theoretical frequencies for the first significant digit.
This represent the example 1 simulation results.
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX
NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX

Más contenido relacionado

La actualidad más candente

2012-02-17_Vojtech-Seman_Rigorous_Thesis
2012-02-17_Vojtech-Seman_Rigorous_Thesis2012-02-17_Vojtech-Seman_Rigorous_Thesis
2012-02-17_Vojtech-Seman_Rigorous_ThesisVojtech Seman
 
60440 question-paper-unit-g451-an-introduction-to-physical-education
60440 question-paper-unit-g451-an-introduction-to-physical-education60440 question-paper-unit-g451-an-introduction-to-physical-education
60440 question-paper-unit-g451-an-introduction-to-physical-educationashfieldpe
 
Deller rpl thesis
Deller rpl thesisDeller rpl thesis
Deller rpl thesisLinda Meyer
 
Closing The Loop: the benefits of Circular Economy for developing countries a...
Closing The Loop: the benefits of Circular Economy for developing countries a...Closing The Loop: the benefits of Circular Economy for developing countries a...
Closing The Loop: the benefits of Circular Economy for developing countries a...Alexandre Fernandes
 
Attribute Interactions in Machine Learning
Attribute Interactions in Machine LearningAttribute Interactions in Machine Learning
Attribute Interactions in Machine Learningbutest
 
Reconstruction of Surfaces from Three-Dimensional Unorganized Point Sets / Ro...
Reconstruction of Surfaces from Three-Dimensional Unorganized Point Sets / Ro...Reconstruction of Surfaces from Three-Dimensional Unorganized Point Sets / Ro...
Reconstruction of Surfaces from Three-Dimensional Unorganized Point Sets / Ro...Robert Mencl
 
From sound to grammar: theory, representations and a computational model
From sound to grammar: theory, representations and a computational modelFrom sound to grammar: theory, representations and a computational model
From sound to grammar: theory, representations and a computational modelMarco Piccolino
 
Group Violence Intervention: Implementation Guide
Group Violence Intervention: Implementation GuideGroup Violence Intervention: Implementation Guide
Group Violence Intervention: Implementation GuidePatricia Hall
 
Quentative research method
Quentative research methodQuentative research method
Quentative research methodMarketing Utopia
 

La actualidad más candente (15)

2012-02-17_Vojtech-Seman_Rigorous_Thesis
2012-02-17_Vojtech-Seman_Rigorous_Thesis2012-02-17_Vojtech-Seman_Rigorous_Thesis
2012-02-17_Vojtech-Seman_Rigorous_Thesis
 
LTMR
LTMRLTMR
LTMR
 
60440 question-paper-unit-g451-an-introduction-to-physical-education
60440 question-paper-unit-g451-an-introduction-to-physical-education60440 question-paper-unit-g451-an-introduction-to-physical-education
60440 question-paper-unit-g451-an-introduction-to-physical-education
 
Thesis
ThesisThesis
Thesis
 
Deller rpl thesis
Deller rpl thesisDeller rpl thesis
Deller rpl thesis
 
TheseHautphenne
TheseHautphenneTheseHautphenne
TheseHautphenne
 
Closing The Loop: the benefits of Circular Economy for developing countries a...
Closing The Loop: the benefits of Circular Economy for developing countries a...Closing The Loop: the benefits of Circular Economy for developing countries a...
Closing The Loop: the benefits of Circular Economy for developing countries a...
 
Attribute Interactions in Machine Learning
Attribute Interactions in Machine LearningAttribute Interactions in Machine Learning
Attribute Interactions in Machine Learning
 
Reconstruction of Surfaces from Three-Dimensional Unorganized Point Sets / Ro...
Reconstruction of Surfaces from Three-Dimensional Unorganized Point Sets / Ro...Reconstruction of Surfaces from Three-Dimensional Unorganized Point Sets / Ro...
Reconstruction of Surfaces from Three-Dimensional Unorganized Point Sets / Ro...
 
From sound to grammar: theory, representations and a computational model
From sound to grammar: theory, representations and a computational modelFrom sound to grammar: theory, representations and a computational model
From sound to grammar: theory, representations and a computational model
 
Abbas
AbbasAbbas
Abbas
 
Group Violence Intervention: Implementation Guide
Group Violence Intervention: Implementation GuideGroup Violence Intervention: Implementation Guide
Group Violence Intervention: Implementation Guide
 
Quentative research method
Quentative research methodQuentative research method
Quentative research method
 
Rand rr2637
Rand rr2637Rand rr2637
Rand rr2637
 
CASE Network Report 41 - Currency Crises in Emerging Markets - Selected Compa...
CASE Network Report 41 - Currency Crises in Emerging Markets - Selected Compa...CASE Network Report 41 - Currency Crises in Emerging Markets - Selected Compa...
CASE Network Report 41 - Currency Crises in Emerging Markets - Selected Compa...
 

Destacado

What Is Media4Math?
What Is Media4Math?What Is Media4Math?
What Is Media4Math?Media4math
 
Pages from f metrics hf_1
Pages from f metrics hf_1Pages from f metrics hf_1
Pages from f metrics hf_1NBER
 
Htt sunum kisaltilmis
Htt sunum kisaltilmisHtt sunum kisaltilmis
Htt sunum kisaltilmisHalit Tugsuz
 
FISCAL STIMULUS IN ECONOMIC UNIONS: WHAT ROLE FOR STATES
FISCAL STIMULUS IN ECONOMIC UNIONS: WHAT ROLE FOR STATESFISCAL STIMULUS IN ECONOMIC UNIONS: WHAT ROLE FOR STATES
FISCAL STIMULUS IN ECONOMIC UNIONS: WHAT ROLE FOR STATESNBER
 
1.Is Uygulamalari 11.50 12.30 Kuresel Krizden Tedarik
1.Is Uygulamalari 11.50 12.30 Kuresel Krizden Tedarik1.Is Uygulamalari 11.50 12.30 Kuresel Krizden Tedarik
1.Is Uygulamalari 11.50 12.30 Kuresel Krizden Tedariktunag
 
Pages from fin econometrics brandt_2
Pages from fin econometrics brandt_2Pages from fin econometrics brandt_2
Pages from fin econometrics brandt_2NBER
 

Destacado (7)

What Is Media4Math?
What Is Media4Math?What Is Media4Math?
What Is Media4Math?
 
Pages from f metrics hf_1
Pages from f metrics hf_1Pages from f metrics hf_1
Pages from f metrics hf_1
 
8.2
8.28.2
8.2
 
Htt sunum kisaltilmis
Htt sunum kisaltilmisHtt sunum kisaltilmis
Htt sunum kisaltilmis
 
FISCAL STIMULUS IN ECONOMIC UNIONS: WHAT ROLE FOR STATES
FISCAL STIMULUS IN ECONOMIC UNIONS: WHAT ROLE FOR STATESFISCAL STIMULUS IN ECONOMIC UNIONS: WHAT ROLE FOR STATES
FISCAL STIMULUS IN ECONOMIC UNIONS: WHAT ROLE FOR STATES
 
1.Is Uygulamalari 11.50 12.30 Kuresel Krizden Tedarik
1.Is Uygulamalari 11.50 12.30 Kuresel Krizden Tedarik1.Is Uygulamalari 11.50 12.30 Kuresel Krizden Tedarik
1.Is Uygulamalari 11.50 12.30 Kuresel Krizden Tedarik
 
Pages from fin econometrics brandt_2
Pages from fin econometrics brandt_2Pages from fin econometrics brandt_2
Pages from fin econometrics brandt_2
 

Similar a NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX

Computer vision handbook of computer vision and applications volume 2 - sig...
Computer vision   handbook of computer vision and applications volume 2 - sig...Computer vision   handbook of computer vision and applications volume 2 - sig...
Computer vision handbook of computer vision and applications volume 2 - sig...Ta Nam
 
BachelorsThesis
BachelorsThesisBachelorsThesis
BachelorsThesiszmeis
 
Thesis Fabian Brull
Thesis Fabian BrullThesis Fabian Brull
Thesis Fabian BrullFabian Brull
 
Quan gsas.harvard 0084_l_10421
Quan gsas.harvard 0084_l_10421Quan gsas.harvard 0084_l_10421
Quan gsas.harvard 0084_l_10421shehab kadhim
 
Computer vision handbook of computer vision and applications volume 1 - sen...
Computer vision   handbook of computer vision and applications volume 1 - sen...Computer vision   handbook of computer vision and applications volume 1 - sen...
Computer vision handbook of computer vision and applications volume 1 - sen...Ta Nam
 
Antenna study and design for ultra wideband communications apps
Antenna study and design for ultra wideband communications appsAntenna study and design for ultra wideband communications apps
Antenna study and design for ultra wideband communications appsvaldesc
 
Design of a bionic hand using non invasive interface
Design of a bionic hand using non invasive interfaceDesign of a bionic hand using non invasive interface
Design of a bionic hand using non invasive interfacemangal das
 
Opinion Formation about Childhood Immunization and Disease Spread on Networks
Opinion Formation about Childhood Immunization and Disease Spread on NetworksOpinion Formation about Childhood Immunization and Disease Spread on Networks
Opinion Formation about Childhood Immunization and Disease Spread on NetworksZhao Shanshan
 
Detecting netflixthrough analysis of twitter
Detecting netflixthrough analysis of twitterDetecting netflixthrough analysis of twitter
Detecting netflixthrough analysis of twitterJack Shepherd
 

Similar a NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX (20)

Computer vision handbook of computer vision and applications volume 2 - sig...
Computer vision   handbook of computer vision and applications volume 2 - sig...Computer vision   handbook of computer vision and applications volume 2 - sig...
Computer vision handbook of computer vision and applications volume 2 - sig...
 
Knustthesis
KnustthesisKnustthesis
Knustthesis
 
Thesis
ThesisThesis
Thesis
 
BachelorsThesis
BachelorsThesisBachelorsThesis
BachelorsThesis
 
Thesis Fabian Brull
Thesis Fabian BrullThesis Fabian Brull
Thesis Fabian Brull
 
HonsTokelo
HonsTokeloHonsTokelo
HonsTokelo
 
Quan gsas.harvard 0084_l_10421
Quan gsas.harvard 0084_l_10421Quan gsas.harvard 0084_l_10421
Quan gsas.harvard 0084_l_10421
 
Computer vision handbook of computer vision and applications volume 1 - sen...
Computer vision   handbook of computer vision and applications volume 1 - sen...Computer vision   handbook of computer vision and applications volume 1 - sen...
Computer vision handbook of computer vision and applications volume 1 - sen...
 
phd-thesis
phd-thesisphd-thesis
phd-thesis
 
Greenberg_Michael_A_Game_of_Millions_FINALC
Greenberg_Michael_A_Game_of_Millions_FINALCGreenberg_Michael_A_Game_of_Millions_FINALC
Greenberg_Michael_A_Game_of_Millions_FINALC
 
thesis
thesisthesis
thesis
 
Diplomarbeit
DiplomarbeitDiplomarbeit
Diplomarbeit
 
Antenna study and design for ultra wideband communications apps
Antenna study and design for ultra wideband communications appsAntenna study and design for ultra wideband communications apps
Antenna study and design for ultra wideband communications apps
 
Design of a bionic hand using non invasive interface
Design of a bionic hand using non invasive interfaceDesign of a bionic hand using non invasive interface
Design of a bionic hand using non invasive interface
 
PhD_Thesis_J_R_Richards
PhD_Thesis_J_R_RichardsPhD_Thesis_J_R_Richards
PhD_Thesis_J_R_Richards
 
Vekony & Korneliussen (2016)
Vekony & Korneliussen (2016)Vekony & Korneliussen (2016)
Vekony & Korneliussen (2016)
 
Opinion Formation about Childhood Immunization and Disease Spread on Networks
Opinion Formation about Childhood Immunization and Disease Spread on NetworksOpinion Formation about Childhood Immunization and Disease Spread on Networks
Opinion Formation about Childhood Immunization and Disease Spread on Networks
 
Digital Reader
Digital Reader Digital Reader
Digital Reader
 
Detecting netflixthrough analysis of twitter
Detecting netflixthrough analysis of twitterDetecting netflixthrough analysis of twitter
Detecting netflixthrough analysis of twitter
 
Master's Thesis
Master's ThesisMaster's Thesis
Master's Thesis
 

Último

Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfSanaAli374401
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docxPoojaSen20
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Shubhangi Sonawane
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterMateoGardella
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong
 

Último (20)

Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 

NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX

  • 1. NEWCOMB-BENFORD’S LAW APPLICATIONS TO ELECTORAL PROCESSES, BIOINFORMATICS, AND THE STOCK INDEX By David A. Torres N´nez u˜ SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE AT UNIVERSITY OF PUERTO RICO RIO PIEDRAS, PUERTO RICO MAY 2006 c Copyright by David A. Torres N´ nez, 2006 u˜
  • 2. UNIVERSITY OF PUERTO RICO DEPARTMENT OF MATHEMATICS The undersigned hereby certify that they have read and recommend to the Faculty of Graduate Studies for acceptance a thesis entitled “Newcomb-Benford’s Law Applications to Electoral Processes, Bioinformatics, and the Stock Index” by David A. Torres N´ nez in partial fulfillment of the requirements u˜ for the degree of Master of Science. Dated: May 2006 Supervisor: Dr. Luis Ra´l Pericchi Guerra u Readers: Dr. Mar´ Egl´e P´rez ıa e e Dr. Dieter Reetz. ii
  • 3. UNIVERSITY OF PUERTO RICO Date: May 2006 Author: David A. Torres N´ nez u˜ Title: Newcomb-Benford’s Law Applications to Electoral Processes, Bioinformatics, and the Stock Index Department: Mathematics Degree: M.Sc. Convocation: May Year: 2006 Permission is herewith granted to University of Puerto Rico to circulate and to have copied for non-commercial purposes, at its discretion, the above title upon the request of individuals or institutions. Signature of Author THE AUTHOR RESERVES OTHER PUBLICATION RIGHTS, AND NEITHER THE THESIS NOR EXTENSIVE EXTRACTS FROM IT MAY BE PRINTED OR OTHERWISE REPRODUCED WITHOUT THE AUTHOR’S WRITTEN PERMISSION. THE AUTHOR ATTESTS THAT PERMISSION HAS BEEN OBTAINED FOR THE USE OF ANY COPYRIGHTED MATERIAL APPEARING IN THIS THESIS (OTHER THAN BRIEF EXCERPTS REQUIRING ONLY PROPER ACKNOWLEDGEMENT IN SCHOLARLY WRITING) AND THAT ALL SUCH USE IS CLEARLY ACKNOWLEDGED. iii
  • 4. To my family, and the extending that always keep faith in me. iv
  • 5. Table of Contents Table of Contents v List of Tables vii List of Figures ix Abstract i Acknowledgements ii Introduction 1 1 Basic Notation and Derivations 4 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Derivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 A Differential Equation Approach. . . . . . . . . . . . . . . . 5 1.2.2 The Float Point Notation Scheme. Knuth . . . . . . . . . . . 7 1.2.3 In the Float Point Notation Scheme. Hamming . . . . . . . . 8 1.2.4 The Brownian Model Scheme. Pietronero . . . . . . . . . . . . 10 1.3 A Statistical Derivation of N-B L . . . . . . . . . . . . . . . . . . . . 11 1.3.1 Mantissa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.2 A Natural Probability Space . . . . . . . . . . . . . . . . . . . 15 1.3.3 Mantissa σ-algebra Properties . . . . . . . . . . . . . . . . . . 15 1.3.4 Scale and Base Invariance . . . . . . . . . . . . . . . . . . . . 17 k 1.4 Mean and Variance of the Db . . . . . . . . . . . . . . . . . . . . . . 23 1.5 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.5.1 Generating r Significant Digit’s Distribution Base b. . . . . . 24 1.5.2 Effects of Bounds in the Newcomb-Benford Generated Values. 25 v
  • 6. 2 Empirical Analysis 30 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.2 Changing P-Values in Null Hypothesis Probabilities H0 . . . . . . . . 31 2.2.1 Posterior Probabilities with Uniform Priors . . . . . . . . . . . 33 2.3 Multinomial Model Proposal . . . . . . . . . . . . . . . . . . . . . . . 36 2.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.5 Conclusions of the examples . . . . . . . . . . . . . . . . . . . . . . . 41 3 Stock Indexes’ Digits 44 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2 Statistical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4 On Image Analysis in the Microarray Intensity Spot 49 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2.1 Microarray measurements and image processing . . . . . . . . 52 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5 Electoral Process on a Newcomb Benford Law Context. 57 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.2 General Democratic Election Model . . . . . . . . . . . . . . . . . . . 58 5.3 Empirical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6 Appendix: MATLAB PROGRAMS. 75 6.1 Matlab Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 vi
  • 7. List of Tables 1 Newcomb Benford Law for the First Significant Digit . . . . . . . . . 2 1.1 Mean, Variance, Standard Deviation and Variation Coefficient for the First and Second Significant Digit Distributions. . . . . . . . . . . . . 24 2.1 p- values in terms of Hypotheses probabilities. . . . . . . . . . . . . . 32 2.2 Summary of the results of the above examples. . . . . . . . . . . . . . 41 3.1 N-Benford’s for 1st and 2nd digit: p- values, Probability Null Bound and Approximate probability for of the different increment . . . . . . 47 3.2 N-Benford’s for 1st and 2nd digit: The probability of the null hypothesis given the data and the length of the data. . . . . . . . . . . . . . . . 47 4.1 N-Benford’s for 1st and 2nd digit: P (H0|data), P (Approx), P (Frac) and P r(BIC). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 st nd 4.2 N-Benford’s for 1 and 2 digit; the number of observations, p-values. 55 5.1 The second digit proportions analysis of the winner for the set of his- torical elections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2 The second digit proportions analysis of the loser for the set of histor- ical elections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.3 The first digit proportions of the distance between the winner and the loser for the set of historical elections. . . . . . . . . . . . . . . . . . . 60 vii
  • 8. 5.4 The second digit proportions of the distance between the winner and the loser for the set of historical elections. . . . . . . . . . . . . . . . 61 5.5 The second digit proportions of the sum between the winner and the loser for the set of historical elections. . . . . . . . . . . . . . . . . . . 61 5.6 The Newcomb Benford’s for 1st and 2nd digit: for the United States of North America Presidential Elections 2004. Note the close are the values of the posterior probability given the data to 1.0. . . . . . . . . 62 5.7 The second digit proportions analysis of the winner for the set of his- torical elections.Number of observed values, p-value and probability null bound is shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.8 The second digit proportions analysis of the loser for the set of histor- ical elections.Number of observed values, p-value and probability null 1 bound is shown. Note that p-values should be smaller than e for the bound to be valid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.9 The first digit proportions of the distance between the winner and the loser for the set of historical elections. Number of observed values, p- value and probability null bound is shown. Note that p-values should 1 be smaller than e for the bound to be valid. . . . . . . . . . . . . . . 63 5.10 The second digit proportions of the distance between the winner and the loser for the set of historical elections.Number of observed values, p-value and probability null bound is shown. Note that p-values should 1 be smaller than e for the bound to be valid. . . . . . . . . . . . . . . 64 5.11 The second digit proportions of the sum between the winner and the loser for the set of historical elections. Number of observed values, p- value and probability null bound is shown. Note that p-values should 1 be smaller than e for the bound to be valid. . . . . . . . . . . . . . . 64 viii
  • 9. List of Figures 1 Newcomb-Benford Law theoretical frequencies for the first and second sig- nificant digit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1 Constrained Newcomb Benford Law compared with a Restricted Bound with of digits in K ≤ 99 from numbers between 1 to 99. Here there is no restriction. 28 1.2 Constrained Newcomb Benford Law compared with a Restricted Bound with of digits in K ≤ 50 from numbers between 1 to 99. . . . . . . . . . . . . 28 1.3 Constrained Newcomb Benford Law compared with a Restricted Bound with of digits in K ≤ 20 from numbers between 1 to 99. . . . . . . . . . . . . 29 2.1 Presenting the posterior intervals for the first and digit using symmetric boxplot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.2 Newcomb-Benford Law theoretical frequencies for the first significant digit. 42 2.3 Newcomb-Benford Law theoretical frequencies for the first significant digit. This represent the example 1 simulation results. . . . . . . . . . . . . . . 42 2.4 Newcomb-Benford Law theoretical frequencies for the first significant digit. This represent the example 2 simulation results. . . . . . . . . . . . . . . 43 2.5 Newcomb-Benford Law theoretical frequencies for the first significant digit. This represent the multinomial example simulation results. . . . . . . . . 43 4.1 Histograms of the Intensities and the Adjustments. . . . . . . . . . . . . 55 4.2 N-Benford’s Law compared whit Intensity Micro array Spots Without Ad- justment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 ix
  • 10. 4.3 N-Benford’s Law compared whit Intensity Micro array Spots With Adjust- ment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.1 Presidential election analysis using electoral college votes compare with N-B Law for the 1st digit. . . . . . . . . . . . . . . . . . . . . . . . . 65 5.2 Presidential election analysis using electoral college votes compare with N-B Law for the 2nd digit. . . . . . . . . . . . . . . . . . . . . . . . . 66 5.3 Puerto Rico 2096 Elections compare with the Newcomb Benford Law for the second digit. . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.4 Puerto Rico 2000 Elections compare with the Newcomb Benford Law for the second digit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.5 Puerto Rico 2004 Elections compare with the Newcomb Benford Law for the first digit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.6 Venezuela Revocatory Referendum Manual Votes Proportions com- pared with the Newcomb Benford Law’s proportions for the Second Digit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.7 Venezuela Revocatory Referendum Manual Votes Proportions com- pared with the Newcomb Benford Law’s proportions for Second digit. 71 5.8 Venezuela Revocatory Referendum Electronic and Manual Votes Pro- portions compared with the Newcomb Benford Law’s for the second digit proportions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.9 Venezuela Revocatory Referendum Manual Distance between the win- ner and loser Proportions compared with the Newcomb Benford Law’s for the proportions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 x
  • 11. Abstract Since this rather amazing fact was discovered in 1881 by the American astronomer Newcomb (1881), many scientist have been searching about members of the outlaws number family. Newcomb noticed that the pages of the logarithm books containing numbers starting with 1 were much more worn than the other pages. After analyzing several sets of naturally occurring data Newcomb went on to derive what later became Benford’s law. As a tribute to the figure of Newcomb we call this phenomenon, the Newcomb - Benford’s Law. We start by establishing a connection between the Microarray and Stock Index data sets. That can be seen as an extension of the work done by Hoyle David C. (2002) and Ley (1996). Most of the analysis have been made using Classical and Bayesian statistics. Here is explained differences between the different scopes on the hypothesis testing between models Berger J.O. and Pericchi L. R. (2001). Finally, the applications of this concepts to the different types of data including Microarray, Stock Index and Electoral Process. There are several results on constrained data, the most relevant is the Constrained Newcomb Benford Law and most of the Bayesian Analysis covered, applied to this problem. i
  • 12. Acknowledgements I wish to express my gratitude to everyone who contributed to making this work pos- sible. I would like to thank God first also Dr. L. R. Pericchi, my supervisor, for his many suggestions and constant support during this research. I am also thankful to the whole faculty of Mathematics for their guidance through the early years of chaos and confusion. Doctor Pericchi expressed his interest in my work and supplied me with the preprints of some of his recent joint work with Berger J. O., which gave me a better perspective on the results. L.R. Pericchi thank for been more than a supervisor, a father and my friend. I’m indebted to Dr. Mar´ Egl´e P´rez, Prof. Z. Rodriguez, and Humberto ıa e e Ortiz Zuazaga, for provided data and insights during the drafting process. Also I would like to thank my parents for providing me with the opportunity to be where I am. Without them, none of this would even be possible. You have always been my biggest fans and I appreciate that. To my father: thanks for the support, even if you are not here anymore. To my mother: you are my hero, always. I would also like to thank my special friends because you have been my biggest critics throughout my entire personal life and professional career. Your encouragement, in- put and constructive criticism have been priceless. For that thanks to Ricardo Ortiz, Ariel Cintr´n, Antonio Gonzales, Tania Yuisa Arroyo, Erika Migliza, Wally Rivera, o Raquel Torres, Dr. Pedro Rodriguez Esquerdo, Dr. Punchin, Lourdes Vazquez (sis- ter), Chungseng Yang (brother), Lihai Song and all the extended family. I would like to thank to my soulmate, Ana Tereza Rodriguez, for keeping me grounded and for providing me with some memorable experiences. Rio Piedras, Puerto Rico David Torres N´ nez u˜ May 15, 2006 ii
  • 13. Introduction The first known person that explain the anomalous distribution of the digits was in The Journal of mathematics and was the astronomer and Mathematician Simon, Newcomb. The one who stated; ”The law of probability of occurrences of numbers is such that all mantissa of their logarithm are equally probable.” Since then many mathematicians have been enthusiastic in findings sets of data suit- able to this phenomena. There has been a century of theorems, definitions, con- jectures, discoveries around the first digit phenomenon. The discovery of this fact goes back to 1881, when the American astronomer Simon Newcomb noticed that the first pages of logarithm books (used at that time to perform calculations), the ones containing numbers that started with 1, were much more worn than the other pages. However, it has been argued that any book that is used from the beginning would show more wear and tear on the earlier pages. This story might thus be apocryphal, just like Isaac Newton’s supposed discovery of gravity from observation of a falling apple. The phenomenon was rediscovered in 1938 by the physicist Benford (1938), who checked it on a wide variety on data sets and was credited for it. In 1996, Hill (1996) proved the result about random mixed distributions of random variables but 1
  • 14. 2 Table 1: Newcomb Benford Law for the First Significant Digit Digit unit 1 2 3 4 5 6 7 8 9 Probability .301 .176 .125 .097 .079 .067 .058 .051 .046 generalizes the Law for dimensionless quantities. Some mathematical series and other data sets that satisfies Newcomb Benford’s Law are: • Prime-numbers • Series distributions • Fibonacci Series • Factorial numbers • Sequences of Powers Numbers in Pascal Triangle • Demographic Statistics • Other Social Science data Numbers that appear in magazines and Newspaper. Intuitively, most people assume that in a string of numbers sampled randomly from some body of data, the first non-zero digit could be any number from 1 through 9. All nine numbers would be regarded as equally probable. As we show in the figure below the equally probable assumption of the digits are very different from the Newcomb Benfords Distribution. As an example for the first digit we have the following discrete probability distribution function.
  • 15. 3 Newcomb−Benford’s Law First Digit st 1 Digit Law y = 1/9 0.2 Probability 0.1 0 1 2 3 4 5 6 7 8 9 Numbers Newcomb−Benford’s Law Second Digit 0.2 nd 2 Digit Law y = 1/10 0.15 Probability 0.1 0.05 0 0 1 2 3 4 5 6 7 8 9 Numbers Figure 1: Newcomb-Benford Law theoretical frequencies for the first and second significant digit. We will present two different situation with different derivations. The first cover data with units, like dollars or meters. The second involves units free data like counts of votes. Applications presented here include: 1. Puerto Rico’s Stock Index. 2. Microarrays data in Bioinformatics. 3. Voting counts in Venezuela, United States and Puerto Rico. The potential uses are detection of fraud, or detection of corruption of data or lack of proper scaling. We analyze the statistical properties of such a Benford’s distribution and illustrate Benford’s law with a lot of data sets of both theoretical and real-life origins. Applications of Benford’s Law such as detecting fraud are summarized and explored to reveal the power of this mathematical principle. Most of the work has been type in Latex format Lamport (1986) and Knuth (1984).
  • 16. Chapter 1 Basic Notation and Derivations 1.1 Introduction The data sets of the family of outlaw’s numbers came from two different kinds of classification. The first is the type of numbers that has different units, like money. The other type of data sets is the numbers that do not have units like votes. This last type of data sets can be found in Electoral Process and Mathematical series. In this chapter are introduced some basic concepts and notation that is consistent with Hill (1996). This formulation helps to understand in a deep way the probabilistic bases on the Newcomb Benford Law. There are slightly different derivations, most of them are not as statistically general as the one presented by Hill. Other example that will be in this discussion is a base invariance, similar to the one presented by L. Pietronero (2001). The aim of this work is to generalize Newcomb Benford Law in order to apply them to wider classes of data sets, and to verify its fit to different seta of data with modern Bayesian Statistical methods. 4
  • 17. 5 1.2 Derivations We present here some derivations, first of all we will use a heuristic approach based on invariance. In this first section, Benford’s law applies to data that are not dimen- sionless, so the numerical values of the data depend on the units. 1.2.1 A Differential Equation Approach. If there exists a universal probability distribution P(x) over such numbers, then it must be invariant under a change of scale Hill (1995a), so P (kx) = f (k)P (x) (1.2.1) Integrating with respect x we have P (kx)dx = f (k) P (x)dx If P (x)dx = 1 and f (k) = 0, then 1 P (kx)dx = , k and normalization proceeds as P (kx)dx = 1 taking y = kx then dy = kdx hence, P (y)dy = 1 k P (y)dx = 1 finally 1 P (y)dx = k
  • 18. 6 Taking derivatives with respect to k ∂P (kx) ′ = xP (k) ∂k = P (x)f (k) −1 = P (x) k2 setting k = 1 gives; ′ xP (x) = −P (x) 1 ′ 1 x( ) = −x 2 x x −1 = x = −P (x) Hence ′ xP (x) = −P (x) (1.2.2) 1 This equation has solution P (x) = x . Although this is not a proper probability distribution (since it diverges), both the laws of physics and human convention impose cutoffs. If many powers of 10 lie between the cutoffs, then the probability that the first (decimal) digit is D is given by the logarithmic distribution
  • 19. 7 D+1 D P (x)dx PD = 10 1 P (x)dx D+1 1 D x dx = 10 1 1 x dx log10 x|D+1 D = log10 x|101 1 = log10 (1 + ) D The last expression is called the Newcomb - Benford’s Law(NBL). However, Benford’s law applies not only to scale-invariant data, but also to num- bers chosen from a variety of different sources. Explaining this fact requires a more rigorous investigation of central limit-like theorems for the mantissas of random vari- ables under multiplication. As the number of variables increases, the density function approaches that of a logarithmic distribution. Hill (1996) rigorously demonstrated that the ”mixture of distribution” given by random samples taken from a variety of different distributions is, in fact, Newcomb Benford’s law. Here it will be presented those results that explain the properties of the NBL. 1.2.2 The Float Point Notation Scheme. Knuth There are conditions for the leading digit Knuth (1981). He noticed that in order to account the leading digit’s law its important to observe the way the numbers be writ- ten in floating point notation. As is suggested the leading digit of u is determined by log(u) mod 1. The operator r mod (1) represent the fractional part of the number r, and fu is the normalizing fraction part of u. Let u be a non negative number. Note
  • 20. 8 that the leading digit u is less than d if and only if (log10 u)mod1 < log10 d (1.2.3) since 10fu = 10(log10 u)mod1 . Taken in preference a random number W from a ran- dom distribution that may occur in Nature, following Knuth, we may expect that (log10 W )mod1 ∼ Unif (0, 1), at least for a very good approximation. Similarly is expected that any transformation of U will be distributed in same manner. Therefore by (1.2.3) the leading digit will be 1, with probability log10 (1 + 1 ) ≈ 30.103%; it will 1 be 2 with probability log10 3 − log10 (2) ≈ 17.609% and in general if r is any real value in [1, 10) we ought to have 10fu ≤ r approximately log10 r of the time. These shows a vague picture why the leading digits behave the way they do. 1.2.3 In the Float Point Notation Scheme. Hamming Another approach was suggested by Hamming (1970). Let p(r) be the probability that 10fU ≤ r, note that r will be in and between 1 and 10 ,(1 ≤ r ≤ 10) and fU is the normalized fraction part of a random normalized floating point number U. Taking in account that this distribution in base invariant, suppose that every constant of our universe are multiplied by a constant factor c; our universe of random floating point numbers, this will no t affect the p(r). When we multiply, there is a transformation from (log10 U)mod1 to (log10 U + log10 c)mod1. Let P r(·) be the usual probability function. Then by definition, p(r) = P r((log10 U) mod 1 ≤ (log10 r) mod 1) On the assumptions of (1.2.3), follows
  • 21. 9 p(r) = P r((log10 U − log10 c) mod 1 ≤ log10 r)  P r((log U mod 1 ≤ log r − log c)    10 10 10   or P r((log U − log c) mod 1 ≤ log10 r, if c ≤ r;  10 10 = P r((log U mod 1 ≤ log r + 1 − log c) 10 10 10      or P r((log U − log c) mod 1 ≤ log r, if c ≥ r;  10 10 10  P r( r ) + 1 − P r( 10 ), if c ≤ r; c c = P r( 10r ) + 1 − P r( 10 ) if c ≥ r; c c Until now the values of r are included in the close interval, [1, 10]. To be methodic it’s important to extend the values of r to values outside the mentioned interval. For 10 this is defined p(10n r) = p(r) + n for a positive number n. If we replace c by d in (1.2.4) can be written as: p(rd) = p(r) + p(d) (1.2.4) Under invariance of the distribution under a constant multiplication hypothesis, then (1.2.4) will be true for all r > 0 and d ∈ [1, 10]. Since p(1) = 0 and p(10) = 1 then 1 = p(10) √ = p(( 10)n ) n √ √ = p(( 10)) + p(( 10)n−1 ) n n √ √ √ = p(( 10)) + p(( 10)) + p(( 10)n−2 ) n n n . . . √ n = np(( 10))
  • 22. 10 √ m hence a derivation of the above p(( n 10m )) = n for all positive integers m and n. Is suggested the continuity of p, its required that p(r) = log10 r. (1.2.5) Knuth suggested that to be more rigorous this important to assume that there is some underlying distribution of numbers F (u); then the desire probability will be p(r) = (F (10m r) − F (10m)) m obtained as a result of adding over −∞ < m < ∞. Then the hypothesis of invariance and the continuity assumption led to (1.2.5) that’s is the desire distribution. 1.2.4 The Brownian Model Scheme. Pietronero From another particular position, note that this can be viewed as a model for the overcoming oscillations of the stock market or any complex model in nature. A brownian model will be acceptable for this type of ”Nature Processes” L. Pietronero (2001). The brownian motion can be seen as a natural event that involves a change in the position or location of something. They propose N(t + 1) = ξN(t) where ξ is a positive definite stochastic variable (just for simplicity). With a logarithmic transformation then a Gaussian process can be found, log(N(t + 1)) = log(ξ) + log(N(t)) If is consider log(ξ), as a stochastic variable then, log(N(t + 1)), is a brownian move- ment. Then for t → ∞ P (log(N)) ∼ Unif (0, 1). Transforming the problem to the original form;
  • 23. 11 1 P (log10 N)d(log10 N) = C dN N where C is the normalization factor. Is obtained that P (N) ∼ N −1 . This suggest that the distribution of n is the First Digit Law distribution. By other hand equation (1.2.5) can be result for any base b. His proposal states that for b > 0 then: n+1 n+1 n+1 logb n+1 P rob(n) = N −1 dN = d(log10 (n)) = log10 = n (1.2.6) n n n log10 b Finally using logarithm properties we get; 1 P rob(n) = logb (1 + ) (1.2.7) n that is a generalization of the Newcomb - Benford’s Law. 1.3 A Statistical Derivation of N-B L Theodore Hill has given a more general argument to the dimensionless data. He has explained the Central-limit-like Theorem for Significant Digit by saying: Remark 1.3.1. ”Roughly speaking, this law says that if probability distributions are selected at random and random samples are then taken from each of these distri- butions in any way so that the overall process is scale or (base)neutral, then the significant - digit frequencies of the combined sample will converge to the logarithmic distribution” Hill (1996) In order to understand such explanation, then here we presented a brief intro- duction to measure theory. A fundamental concept in the development of the theory behind the family of outlaws numbers is the mantissa. This permits the isolation of the groups of significant digits.
  • 24. 12 1 2 3 Let Db , Db , Db , . . ., denote the significant digit function(base b). 1 2 3 Example 1.3.1. As an example note that D10 (25.4) = 2, D10 (25.4) = 5 and D10 (25.4) = 4. The exact laws were given by Newcomb (1881) in terms of the significant digits base 10 are: (1) (1) 1 (1) P rob(D10 = d10 ) = log10 (1 + (1) ); d10 = 1, 2, . . . , 9. (1.3.1) d10 9 (2) (2) 1 (2) P rob(D10 = d10 ) = log10 (1 + (2) ); d10 = 0, 1, . . . , 9. (1.3.2) k=1 10k + d10 This equations show a way to write the NBL in terms of the significant digit. 1.3.1 Mantissa As we had mention, a way to formalize the form to write numbers in terms of the digits, then here is introduced the mantissa. The aim of define the mantissa was put the NBL in a proper countable additive probability framework. Basically the NBL is a statement in terms of the significant digits functions. Definition 1.3.1. The mantissa (base 10) of a positive real number x is the unique 1 number r in ( 10 , 1] with x = r ∗ 10n for some n ∈ Z. To be more familiar with the mantissa definition looks up to the scientific notation. Definition 1.3.2. A number is in scientific notation if it is in the form: M antissa ∗ 10characteristic
  • 25. 13 , where the mantissa (Latin for makeweight) must be any number 1 through 10 (but not including 10), and the characteristic is an integer indicating the number of places the decimal moved. A more general definition of mantissa can be presented, a generalization for any base b > 0, that’s is as follows; Definition 1.3.3. For each integer b > 0, the (base b) mantissa function, Mb , is the function Mb : R+ → [1, b) such that Mb (x) = r, where r is the unique number in [1, b) with x = r ∗ 10n for some n ∈ Z. For E ∈ [1, b), let E b −1 = Mb (E) = bn E ⊂ R+ n∈Z The (base b) mantissa σ- algebra generated by R+ . Example 1.3.2. Using the function Mb defined above, we can verify that 9 have the same mantissa function image for different bases, 10, 100 and 1000. For this note that M10 (9) = 9, since 9 = r ∗ 10n = 9 ∗ 100 , note that n = 0 and r = 9. The same case for base b = 100, here n = 2 and r = 9 again. 9 Moreover note that for b = 2 we have M2 (9) = 8 = 1.001 (base 2), since x = 9 23 ., 8 9 this is close to the definition since 8 ∈ [1, 2). Remark 1.3.2. Note that the mantissa function, Mb , assigns it a unique value hence its well define. An observation is that if E = [1, b) then E b −1 = Mb (E) = n∈N −{1} bn E = R+ . And {1} 10 = {10n : n ∈ Z}
  • 26. 14 Lemma 1.3.3. For all b ∈ N − {1}, n−1 (i) E b = k=0 bk E bn (ii) Λb = { Eb : E ∈ B(1, b)} (iii) Λb ⊂ Λbn ⊂ B for all n ∈ N (iv) Λb is closer under scalar multiplication. Proof. Part (i) of the lemma follows directly from the definition of b; (ii) fol- lows from the facto that if E is a Borel set in (1, b) then Λb will denote the set n−1 of { k=0 bk E bn E ∈ B(1, b)}. Taking point (i) and (ii) together with naturalism we get point (iii). The last point of the lemma follows from point (ii) since Λb is the (1) (2) (3) σ-algebra generated by {Db , Db , Db , . . .} For a more general case of the s-digit law we have: 1 P rob(mantissa ≤ ) = log10 (t); t ∈ [1, 10) ⊆ ℵ (1.3.3) 10 . Since we can write (1.3.3) using the digits we have: Definition 1.3.4. General Significant Digit Law For all positive integer k, all dj ∈ {0, 1, 2, . . .} (1) (1) (2) (2) (k) (k) 1 P rob(D10 = d10 , D10 = d10 , . . . , D10 = d10 ) = log10 [1 + k (i) ] (1.3.4) i=1 d10 ∗ 10k−i Corollary 1.3.4. The significant digits are dependent.
  • 27. 15 This corollary can be proof giving a counter example, that’s the way that Hill work it out. Now is important to state a natural probability space in which we can describe in a proper form every detail in each Newcomb Benford’s Law scheme. At this point is needed a strong measure theory tools, as the σ-fields generated by the set of the r significant digits. 1.3.2 A Natural Probability Space Let the sample space R+ be the set of positive real numbers. And let the sigma alge- (1) (2) (3) bra of events simply be the σ-field generated by{D10 , D10 , D10 , . . .} or equivalently, generated by mantissa function: x → mantissa(x). This σ-algebra denoted by Λ and will be called the decimal mantissa σ-algebra. This is a subfield σ of the Borel’s sets and ∞ S∈Λ⇔S= B ∗ 10n (1.3.5) n=−∞ ∞ for some Borel B ⊆ [1, 10), which is the generalization of D1 = n=−∞ [1, 2) ∗ 10n that’s is the set of positive numbers witch first digit is 1. 1.3.3 Mantissa σ-algebra Properties The mantissa σ-algebra have several properties; 1. Every non empty set in Λ is infinity with accumulation point at 0 and at +∞. 2. Λ is closer under scalar multiplication. 3. Λ is closer under integral roots, but not powers.
  • 28. 16 4. Λ is self - similar in the sense that S ∈ Λ, then 10m ∗ S = S for every integer m. Here aS and S a represent respectively {as : s ∈ S} and {sa : s ∈ S}. The first property implies that finite intervals are not include in Λ, are not expressible in term of the significant digits. Note that significant digits alone can no be distinguished between the numbers 10 and 100 and thus the countable additive contradiction as- sociated with the scale invariance disappear. Properties 1, 2 and 4 follow directly by 1.3.5 but the closure under integral roots needs more details. The square root of a set in Lambda may need two parts and similarly for higher roots. Example 1.3.5. If ∞ S = {D1 = 1} = [1, 2) ∗ 10n , n=−∞ then ∞ ∞ 1 n S = 2 [1, (2)) ∗ 10 ∪ [ (10), (20)) ∗ 10n ∈ Λ n=−∞ n=−∞ but ∞ 2 S = [1, 4) ∗ 102∗n ∈ Λ n=−∞ Since it has gaps (which are too large) and thus can not be written in terms of the digits. Just as property 2 is the key to the hypotheses of the scale invariance, property 4 is for base invariance a well.
  • 29. 17 1.3.4 Scale and Base Invariance The mantissa σ−algebra Λ represent a proper measurability structure. In order to be rigorous is time to state a proper definition of a scale invariant measure. Definition 1.3.5. A probability measure P on (R+ , Λ) is scale invariant if P (S) = P (sS) for s > 0 and all S ∈ Λ. The NB Law 1.3.3 1.3.4 is characterize by the scale invariance property. Theorem 1.3.6. Hill (1995a)A probability measure P on (R+ , Λ) is scale invariant if and only if ∞ P( [1, t) ∗ 10n ) = log10 t (1.3.6) n=−∞ for all t ∈ [1, 10). Definition 1.3.6. A probability measure P on (R+ , Λ) is base invariant if P (S) = 1 P (S 2 ) for all positive integers n and all S ∈ Λ. Observe that the set of numbers St = {Dt = t, Dj = 0∀j = t ∧ t ∈ [1, 10)} = {. . . , 0.0t, 0.t, t, t0, t00, . . .} ∞ = n=−∞ [1, t) ∗ 10n has by 1.3.5 no nonempty Λ− measurable subsets. Recall the definition of a Dirac measure: Definition 1.3.7. The Dirac measure δt associated to a point St ∈ Λ is defined as follows: δt (St ) = t if t ∈ St and δt (St ) = 0 if t ∈ St
  • 30. 18 Using the above definition and letting PL denote the logarithmic probability dis- tribution on (R+ , Λ) given in 1.3.3, a complete characterization for base-invariant significant- digit probability measures can now be given. Theorem 1.3.7. Hill (1995a)A probability measure P on (R+ , Λ) is base invariant if and only if P = qPL + (1 − q)δt for some q ∈ [0, 1] Note that P is as a convex combination of the two measures; PL and δt . Using theorems 1.3.6 and 1.3.7 T. Hill state that scale invariance implies base invariant but not conversely. This is because δt is base invariant but not scale invariant. The proof of those theorems are not relevant but important in the sense of resume the statistical derivation presented by T. Hill. Recall that a (real Borel) random probability measure (r.p.m.) M is a random vector (on an underlying probability space (Ω; F; P ) taking values which are Borel probability measures on R, and which is regular in the sense that for each Borel set B ⊂ R, M (B) is a random variable. Definition 1.3.8. The expected distribution measure of r.p.m F is the probability measure EF (on the borel subsets of R) defined by (EM )(B) = E(M (B))for all Borel B ⊂ R (1.3.7) (where here and throughout, E( ) denotes expectation with respect to P on the underlying probability space). The next definition plays a central role in this section, and formalizes the concept of the following natural process which mimics Benford’s data-collection procedure:
  • 31. 19 pick a distribution at random and take a sample of size k from this distribution; then pick a second distribution at random, and take a sample of size k from this second distribution, and so forth. Definition 1.3.9. For a r.p.m M and positive integer k, a sequence of M − randomk − samples is a sequence of random variables X1 , X2 . . . on (Ω; F; P ) so that for some i.i.d. sequence M1 , M2 , . . . of r.p.m.’s with the same distribution as M , and for each j = 1, 2, . . . given Mj = P , the random variables X(j−1)k+1 . . . , Xjk are i.i.d. with d.f. P ; and X(j−1)k+1 . . . , Xjk are independent of {Mi; X(i−1)k+1 , . . . , Xik for all i = j. The following lemma shows the somewhat curious structure of such se- quences. Lemma 1.3.8. Hill (1995a) Let X1 , X2 . . . be a sequence of M −randomk −samples for some k and some r.p.m. M . Then (i) the Xn are a.s. identically distributed with distribution EM , but are not in general independent, and (ii) given M1 , M2 , . . ., the Xn are a.s. independent, but are not in general identically distributed. As Hill state in his paper: Remark 1.3.3. In general, sequences of M − randomk − samples are not in- dependent, not exchangeable, not Markov, not martingale, and not stationary se- quences. Example 1.3.9. Let M be a random measure which is the Dirac probability measure 1 δ(1)+δ(2) δ(1) at 1 with probability 2 , and which is 2 otherwise, and let k = 3. Then M1 1 will be assigned to δ(1) with probability 2 and M2 otherwise.
  • 32. 20 (i) Since 11 1 P (X2 = 2) = P (X2 = 2|M1 )P (M1 ) + P (X2 = 2|M2 )P (M2 ) = 0 + = , 22 4 1 but P (X2 = 2|X1 = 2) = P (x2 = 2|M2 ) = 2 , so X1 , X2 are not independent. 9 3 (ii) Since P ((X1, X2 , X3 , X4 ) = (1, 1, 1, 2)) = 64 > 64 = P ((X1 , X2 , X3 , X4 ) = (2, 1, 1, 1)), the (Xn ) are not exchangeable; (iii) Since 9 5 P (X3 = 1|X1 = X2 = 1) = > = P (X3 = 1|X2 = 1), 10 6 the (Xn ) are not Markov. (iv) since 3 E(X2 |X1 = 2) = = 2, 2 the (Xn ) are not a martingale; (v) and since 9 15 P (X1, X2 , X3 ) = (1, 1, 1)) = > = P ((X2, X3, X4) = (1, 1, 1)), 16 32 the (Xn ) are not stationary. The next lemma is simply the statement of the intuitively fact that the empirical distribution of M − randomk − samples converges to the expected distribution of M . Lemma 1.3.10. Hill (1995a) Let M be a r.p.m., and let X1 , X2 . . . be a sequence of IM − randomk − samples for some k. Then ♯{i ≤ n : Xi ∈ B} lim = E[M (B)] (1.3.8) n→∞ n a.s. for all Borel B ⊂ R.
  • 33. 21 Note that if we choose k = 1, taking fix B and j ∈ N, and let Yj = ♯{Xj ∈ B then m ♯{i ≤ n : Xi ∈ B} j=1 Yj lim = lim (1.3.9) n→∞ n n→∞ n By 1.3.8 , given Mj ,Yj as the Bernoulli case with parameter 1 and E[M j (B)], so by 1.3.7 EYj = E(E(Yj |M j )) = E[M (B)] (1.3.10) a.s. for all j, since M j has the same distribution of M . By 1.3.8 the {Yj } are independent. Since they have 1.3.10 identical means E[M (B)], and are uniformly bounded, it follows Lo`ve (1977) that e m j=1 Yj lim = E[M j (B)] (1.3.11) n→∞ m a.s. This basically it is just the Bernoulli case of the strong law of large numbers. Remark 1.3.4. Roughly speaking, this law says that if probability distributions are selected at random, and random samples are then taken from each of these distri- butions in any way so that the overall process is scale (or base) neutral, then the significant digit frequencies of the combined sample will converge to the logarithmic distribution. At this far a proper definition of a random sequence in terms of the mantissa is expressed. Definition 1.3.10. A sequence of random variables X1 , X2 . . . has scale-neutral man- tissa frequency if |♯{i ≤ n : Xi ∈ S} − ♯{i ≤ n : Xi ∈ sS}| → 0 a.s. n
  • 34. 22 for all s > 0 and all S ∈ M, and has base-neutral mantissa frequency if 1 |♯{i ≤ n : Xi ∈ S} − ♯{i ≤ n : Si ∈ S 2 }| → 0 a.s. n for all m ∈ N and S ∈ M. Definition 1.3.11. A r.p.m. M is scale-unbiased if its expected distribution EM is scale invariant on (R+ , bmM) and is base-unbiased if E[M (B)] is base-invariant on (R+ , bmM). (Recall that M is a sub σ-algebra of the Borel, so every Borel probability on R (such as EM ) induces a unique probability on (R+ , M ).) The main new statistical result, here M (t) denotes the random variable M (Dt ), where ∞ Dt = [1, t) × 10n n=−∞ 1 t is the set of positive numbers with mantissa in [ 10 , 10 ). M (t) may be viewed as the random cumulative distribution function for the mantissa of the r.p.m. M . Theorem 1.3.11. (Log-limit law for significant digits). Let M be a r.p.m. on (R+ , bmM). The following are equivalent: (i) M is scale-unbiased (ii) M is base-unbiased and E[M (B)] is atomless; (iii) E[M (t)] = log10 t for all t ∈ [1, 10); (iv) every M -random k-sample has scale-neutral mantissa frequency; (v) EM is atomless, and every M -random k-sample has base-neutral mantissa fre- quency;
  • 35. 23 (vi) for every M -random k-sample X1 , X2 . . . , 1 t ♯{i ≤ n|mantissa(Xi ) ∈ [ 10 , 10 )} → log10 t a.s. for all t ∈ [1, 10). n The statistical log-limit significant-digit law help justify some of the recent appli- cations of Newcomb Benford’s Law, several of which will now be described. Remember that most of the results on this section are transcribed and commented using Hill (1996). The proof of each one of the results are included by referencing each lemma and theorem. k 1.4 Mean and Variance of the Db The numerical values of the Significant Digit Law for the first digit can be computed numerically using these expressions: 9 (k) (k) E(Db ) = nP rob(Db = n) (1.4.1) i=1 9 (k) (k) (k) V ar(Db ) = n2 P rob(Db = n) − E(Db )2 (1.4.2) i=1 If we state the values for k = 1 to 9. As an example lets suppose as usual that b = 10 then we already know the theoretical values for the distribution of the first and second significant digit. Then some statistics for this distribution are: The standard devia- tion is the well know distance of the point to the mean and the variation coefficient is the ratio between the standard deviation and the mean of the distribution. These are the most central tendency measured used bay researchers.
  • 36. 24 Table 1.1: Mean, Variance, Standard Deviation and Variation Coefficient for the First and Second Significant Digit Distributions. Mean V ariance ST DEV V ariation F irst 3.44024 6.05651 2.46099 0.71536 Second 4.18739 8.25378 2.87294 0.68609 1.5 Simulation Let as usual X be a random variable having Benford’s distribution. Using 1.3.6, then X can be generated via X ← ⌊10U ⌋ (1.5.1) where U ∼ U nif (0, 1). Note that the operator ⌊⌋ represent the integer part of the number between the symbols. Actually the above expression if for the first signifi- cant digit. The interesting case is how to generate random values from each of the marginates of the Generalized Newcomb Benford’s distribution for all digits not only the first. Moreover if there is some bound on the maximum number N (like in elec- tions). How it would be a ”Newcomb Benford’s Law under a restriction?” how bounds affect the sample generated? 1.5.1 Generating r Significant Digit’s Distribution Base b. For this remember that the Significant Digit Law can be stated as: 1 Fx (x) = log10 (1 + ) (1.5.2) x
  • 37. 25 for x = 10r−1 , 10r−1 +1, . . . , 10r −1. Then going directly to the definition of probability the expression above can be written as: Fx (x) = P r(X ≤ x) x 1 = i=10r−1 log10 (1 + i ) x = i=10r−1 (log10 (i + 1) − log10 i) (1.5.3) = log10 (x + 1) − log10 10r−1 (by the hypergeometric series) = log10 (x + 1) − r + 1 Hence for the cumulative distribution function can be stated as: Fx (x) = log10 (x + 1) − r + 1. (1.5.4) Note that the same derivation can be done using an arbitrary base b. In order to generate values from this distribution lets suppose that u ∼ U nif (0, 1), and also suppose as usual that, u = Fx (x). Substituting is 1.5.2, and solving for x are get: 10u+r−1 − 1 = x using the floor function to get a closed form expression, X ∼ ⌊10u+r−1 ⌋. (1.5.5) Moreover this can be generalized to a base b > 0, for which X ∼ ⌊bu+r−1 ⌋ (1.5.6) where U ∼ U nif (0, 1). 1.5.2 Effects of Bounds in the Newcomb-Benford Generated Values. There is an open question; if there is an upper bound of the data values, what effect this have if any, on Newcomb Benford’s Law? For this, suppose as above that
  • 38. 26 X ∼ NBenf ord(r, b), that is that X is a random variable distributed as a Newcomb - Benford’s Law for a digit r and base b > 0. Using equation 1.5.5 we can generate the marginal distribution applying a modular function base b. Thats is X ∼ ⌊bU +r−1 ⌋mod b (1.5.7) with U ∼ U (0, 1) Note that for b = 10 and r = 2, expression 1.5.5 will generate numbers from the set {10, 11, 12, . . . , 99}. Lets define K be an upper bound, for experimental observations : X ← ⌊X U +2−1 ⌋ (1.5.8) Then Z = ⌊10U +1 ⌋I (0,K] (z) (1.5.9) where I (0,K] is the indicator function defined as 1, x ∈ S; I S (x) = 0, otherwise. When we use r = 2 we are generating from the second digit law. There are some complications at the moment of generate numbers from 1 to 99. Since for this case there are two different types of numbers, from 1 to 9, the case that the number of digits is one, and second the numbers from 10 to 99, the number of digits is two. Since the equation 1.5.7 depends on the number of digits to simulate, there is the need to simulate proportionally form the set of numbers from 1 to 9 and the set of 1 8 numbers 10 to 99. The proportion of the first set of numbers is 9 and 9 for the second set. The trick here is to generate 1/9 of the sample size using the random numbers from a N-B Distribution with r = 1 and the other 8/9 of the desired sample form a
  • 39. 27 N-B Distribution with r = 2. This can be generalized for larger r’s. The main topic in this section is know the way that the N-B Law acts with bounds. For this some notation is needed. (i) pB is the Newcomb Benford Probability Distribution for number i. i (ii) pC under the constraint N ≤ K is the proportion of the numbers in the set that i will be sampled; (iii) pU is the proportion of the numbers in the set under no constraints; As an observation, if there is not a bound then pc = pu . 11 Example 1.5.1. Suppose that K = 52 then pC = 1 52 and pU = 1 . 1 9 Definition 1.5.1. The ”Constrained N-B Law Distribution” is defined as: j pC pB (Di ) pi U P (Di = x|N ≤ M) = i C (1.5.10) B j pk k p (Dk ) pU k Suppose that we want a bound in N = 65 then lets compare how close the theo- retical function 1.5.10 is close to the simulated using the bound. The following figures present different simulations with different bounds or constraints; The conclusion is that the argument at the theoretical Law under constraints 1.5.10 and the simulation is excellent. In fact, equation 1.5.10 may be considerate the ”Constrained Newcomb Benford Law”. To our knowledge this is the first time it has been introduced. Note that 1.5.10 can be adapted for lower bound also.
  • 40. 28 Bound in:99 of generated numbers Benford Dist. 1 Simulated bound 0.9 Theory bound NB law 0.8 0.7 Probability Distribution 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 9 Numbers Figure 1.1: Constrained Newcomb Benford Law compared with a Restricted Bound with of digits in K ≤ 99 from numbers between 1 to 99. Here there is no restriction. Bound in:55 of generated numbers Benford Dist. 1 Simulated bound 0.9 Theory bound NB law 0.8 0.7 Probability Distribution 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 9 Numbers Figure 1.2: Constrained Newcomb Benford Law compared with a Restricted Bound with of digits in K ≤ 50 from numbers between 1 to 99.
  • 41. 29 Bound in:25 of generated numbers Benford Dist. 1 Simulated bound 0.9 Theory bound NB law 0.8 0.7 Probability Distribution 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 9 Numbers Figure 1.3: Constrained Newcomb Benford Law compared with a Restricted Bound with of digits in K ≤ 20 from numbers between 1 to 99.
  • 42. Chapter 2 Empirical Analysis 2.1 Introduction The analysis of the uncertainty is old as the civilization itself. There are several interpretations of those phenomenons that rule the Nature in the most general case. In modern times the basis of this theory are lectures from Bernoulli, Laplace and Thomas Bayes. Characterize the knowledge in chance and uncertainty using measure tools provided by the Logic is the fundamental baseline in most of the results here. There are distinctions in Classical and Bayesian Statistics. We discuss at least the null hypothesis probabilities and the p-value. Explore a concise analysis of the Newcomb Benford sequences in a Bayesian scheme using state to the art tools in order to calculate how close the data is close to the Law. Finally we state some examples in order to taste how the Newcomb Benford Law works as the mixture in probability random variables get more complicate. 30
  • 43. 31 2.2 Changing P-Values in Null Hypothesis Prob- abilities H0 The p- value is the probability of getting values of the test statistic as extreme as, or more extreme than, that observed if the null hypothesis is true. For a single sample the χ2 Statistics is given as, 9 (1) (P rob(D10 = D) − fD )2 χ2 Statistics = (1) (2.2.1) D=1 P rob(D10 = D) Where fD , are the first digit of each data entry. This is the basis of a classical test to the null hypothesis which is that the data follows the Newcomb Benford Law. If the null hypothesis is accepted the data ”passed” the test. If not, it opens the possibility of being manipulated data. As we had presented most of the data that we pretend to analyze respond as a group of random mix of models. We have to specify our null, H0 , and is alternative, H1 , hypothesis. In the electoral process sense is; H0 means that there are no intervention with the data on the other hand, H1 means that there is intervention in the data gathering process. Is important to get the Null Hypotheses measure against it is evidence. In our case if the data obeys Benford´ Law implies s that there is no intervention in the electoral votes. There is a misunderstood between the probability of the Null Hypothesis and the p − value. For a Null Hypotheses, H0 , we have (Berger O. J., 2001): Pval = P rob(Result equal or more extreme than the data—Null Hypothesis)
  • 44. 32 [tbp] Table 2.1: p- values in terms of Hypotheses probabilities. Pval P (H0 |data) 0.05 0.29 0.01 0.11 0.001 0.0184 If the p- values is small (Ex. p-values < 0.05 or less) there is a significant observation. But p- values are not null hypotheses probabilities. If P (H0 ) = P (H1 ),surprise that H0 has produced an unusual observation, and for Pval < e−1 , then: P (H0 |data) ≥ −e.Pval . loge [Pval ] ⇒ P (H1 |data) P (H0|Pval ) ≥ P (H0 |data) = 1/(1 + [−e.Pval . loge (Pval )]−1 ) A full discussion about this mater can be found in, (Berger O. J., 2001) Is more natural to calculate p-value with respect the goodness of fit test of the proportions in the observed digits versus those in proportions specified by the Newcomb-Benford’s Law. As we can see in the 2.2, the correction is quite important in order to improve the calculations. This table show how larger this lower bound is than the p- values. So small p- values (i.e.pval = 0.05), imply that the posterior probability at the null hypotheses is at least 0.29, is not very strong evidence to reject a hypotheses. As an alternative procedure we can use the BIC (Bayesian Information Criterion) or the Schwartz’s Criteria (Berger J.O. and Pericchi L. R., 2001) that take the sample size in explicit form:
  • 45. 33 P (H0|data) k1 − k0 log[ ] ≈ log(Likelihood Ratio) + log(N) (2.2.2) P (H1|data) 2 The likelihood ratio can be calculated from a multinomial density distribution. In the numerator we will have the proportions assigned from the H0 and in the denominator the data digit proportions. The evidence against the null hypothesis can be measure using the BIC. In this case the null hypothesis represents that the data follows a N-Benford’s Law distribution. 2.2.1 Posterior Probabilities with Uniform Priors Let Υ1 be set of integers in [1, 9] and Υ2 be the integers included in the interval [0, 9]. The elements that may appear when the first digit be observed will be members of Υ1 and if we were meant to observe the second digit or other different from the first position will be member of Υ2 . In the case that the same index can be applied to the first or any other digit member of Υ1 or Υ2 we will refereed as a member of Υ. Let k Ω = {p1 = p01 , p2 = p20 , . . . , pk = pk0 | p0i = 1} i=1 for k = 1, . . . , 9 in the case of the first digit and k will be extended to 10 for other digit. Note that using the defined set above we can rewrite Ω in terms of Υ as follows; Ω = {p1 = p01 , p2 = p20 , . . . , pk = pk0∀k ∈ Υ| p0i = 1}. i∈Υ Then our hypothesis can be write as: H0 = Ω (2.2.3) H1 = Ωc where Ωc means the complement of Ω. In other words Ωc = {pi = p0i ∀i ∈ Υ}.
  • 46. 34 Assume an uniform prior for the values of the pi s, then, Πu (p1 , p2 , . . . , pk ) = Γ(k) = (k − 1)! (2.2.4) We can write the posterior probability of H0 in terms of the Bayes Factor. Let x be de data vector and by definition of Bayes Factor we have that: P (H0|x)P (H1) B01 = (2.2.5) P (H1|x)P (H0) If we have nested models and P (H0 ) = P (H1) = 1 , then the Bayes Factor reduces to 2 P (H0 |x) B01 = P (H1 |x) P (H0 |x) = 1−P (H0 |x) 1 = 1 P (H0 | x) −1 (2.2.6) 1 1 ⇔ P (H0 |x) −1= B01 1 1 ⇔ P (H0 |x) = B01 + 1 1 B01 +1 ⇔ P (H0 |x) = B01 therefore B01 P (H0|x) = (2.2.7) B01 + 1 For the i significant digit of each element of the data vector n = (n1 , n2 , . . . , nk ) that ni means the times that appear i in each element of the data. Recall that if we observe the first digit then i ∈ Υ1 but for the second and onwards i ∈ Υ2 , or more general as the convention i ∈ Υ for any of the cases above. Using the definition applied to problem 2.2.3, we have f (n1 , n2 , n3 , . . . , nk |Ω) B01 = Ωc f (n1 , n2 , n3 , . . . , nk |Ωc )ΠU (p1 , p2 , p3 , . . . , pk )dp1 dp2 dp3 . . . dpk−1
  • 47. 35 with i∈Υ pi = 1 and pi ≥ 0∀i ∈ Υ. Substituting in our problem k k n! ni ! i=1 pn i i0 B01 = +∞ i=1 k (k − 1)! −∞ k n! ni ! i=1 pni +1−1 dpi i0 i=1 Cancel several factorial terms and using the following identity: +∞ k k Γ(ni + 1) pni +1−1 dpi = i0 i=1 −∞ i=1 Γ(n + k) Follows to a simplified expression for B01 : pn 1 pn 2 · · · pn k 10 20 k0 B01 = k (2.2.8) Γ(ni +1) (k − 1)! i=1 Γ(n+k) Then we already know how get the posterior probability using the Bayes Factor (using 2.2.7) then substituting B01 we have: n1 2 n kn p10 p20 ···pk0 k Γ(ni +1) (k−1)! i=1 Γ(n+k) P (H0 |x) = n1 n2 nk (2.2.9) p10 p20 ···pk0 k Γ(ni +1) +1 (k−1)! i=1Γ(n+k) There are different forms to calculate the probability of the null hypothesis given a certain data; each one depends on the prior’s knowledge and the type of Bayes Factor or an approximation in use (i.e. P (Frac) is based on the Fractional Bayes Factor (Berger J.O. and Pericchi L. R., 2001)) f0 (data|p0 ) BF01RAC = F f1 (data|p)π N (p)d(p) Ω r (2.2.10) data|p)πN (p)d(p) f n( r= Ω 1 f n (data|p0 ) where p0 is given by the Newcomb Benford Law and r is the number of adjustable parameters minus one, that is r = 8 or r = 9, for the first and second digit respectively. The P (Approx) is based on the following approximation on the Bayes factor; Approx f0 (data|p0 ) 1− r n r BF01 =( ) n ( )2 (2.2.11) f1 (data|p) ˆ r
  • 48. 36 where p in the maximum likelihood estimator of p. And the GBIC is based on a still ˆ unpublished proposal by (Berger J.O., 1991) This is based on the prior in (Berger J.O., 1985). 2.3 Multinomial Model Proposal In the following case let ti digit then i ∈ Υ as usual. This can be think (Ley, 1996) as a random variable N distributed a multinomial distribution with vector parameter θ; thus ( j∈Υ nj )! n f (N |θ) = θj (2.3.1) j∈Υ nj ! j∈Υ 1 As usual we will assume uniform as a prior knowledge for theta whit mean k where k be the cardinality of the set Υ, thats means if we are working with the first digit then k = |Υ1 | = 9 and if the observes significant digit is the second or more then k = |Υ2 = 10, that is for each one of the θ. The natural conjugate prior is a Dirichlet density. This distribution has the following general form; k k−1 Dik (θ|α) = c(1 − θl ) αk+1 −1 pαl −1 l (2.3.2) l=1 l=1 Where k+1 Γ( l=1 αl ) c= k+1 l=1 Γαl and α = (α1 , α2 , . . . , αk+1 ) such that every α > 0 and p = (p1 , p2 , . . . , pk ) with k 0 < pi < 1 and l=1 pl = 1. For simplicity we will use each αi = α; thus Γ(kα) g(p) = pα−1 (2.3.3) Γ(α)k j∈Υ j
  • 49. 37 The posterior distribution of the p is given by a Dirichlet whit parameter {α + n1 , α + n2 , . . . , α + n9 }. Then we have that Γ(kα + j∈Υ nj ) α+n −1 h(θ|x) = θj j (2.3.4) j∈Υ Γ(α + xj ) j∈Υ 2.4 Examples Our aim now is to show empirically how efficient can be the reasoning 1.3.1 given by (Hill, 1996). Our first examples denote an exponential family distribution func- tion. Most of the application involve that involve a multilevel analysis are called a hierarchical models. This type of models allow a more ”objective” approach to inference by estimating the parameters of prior distributions from data rather than requiring them to be specified using subjective information (Gelman A., 1995, Carlin Bradley P., 2000). Example 2.4.1. The simplest model that we present here is a Poisson Model with a fixed parameter λ. For this first case the 500 values are simulated with λ = 100. The P (H0 |data) = 0, that indicate how poor is this model to simulate a Benford Process. As Hill stated and as we had discussed in early chapters, the NB Law can be satisfied if there is a random mixture of mixture distributions. In the Figure 2.2 is show how poor is the frequencies of the first digit of the simulated values compared with the N-B Law for the first digit. Remember that this model is the simple one, do not have a hierarchical structure. Example 2.4.2. The following is a simple hierarchical model have two stages some of the parameters are fixed. A frequently model used in actuarial sciences, and quality
  • 50. 38 Marginal Posterior Boxplot of Newcomb Benford for First Digit. 0.3 0.25 Proportion 0.2 0.15 0.1 0.05 1 2 3 4 5 6 7 8 9 Number (a)First digit boxplot. Marginal Posterior Boxplot of Newcomb Benford for Second Digit. 0.13 0.12 0.11 Proportions 0.1 0.09 0.08 0 1 2 3 4 5 6 7 8 9 Number (b) Second digit boxplot. Figure 2.1: Presenting the posterior intervals for the first and digit using symmetric boxplot.
  • 51. 39 control. n ∼ P ois(λν) (2.4.1) λ ∼ G(θ, α) The probability distribution is given by 1 P g(n|α, β, ν) = P ois(n|λν)G(λ|α, β)dλ 0 The resulting expression is know as the generalization of negative binomial distribution β Nb(n|α, β+ν ). 1 e−λν (λν)n β α λα−1 e−βλ P g(n|α, β, ν) = dλ 0 Γ(n + 1) Γ(α) 1 β αν n = λn+α−1 e−(β+ν)λ dλ Γ(α)Γ(n + 1) 0 α n Γ(n + α)ν n β ν = Γ(n + 1)Γ(α) β + ν β+ν First we will think the Gamma part of the model above, as a mixture of different distributions of the parameter λ in the Poisson distribution function. The values of the different values of the set of parameters λ will be {10, 20, 30, 50, 70}. Each vector of the overall simulated data will correspond to the Poisson model whit partitions of length 50. Making the Benford analysis we get P (H0 |data) = 0.878719187. Here we can denote that for this small examples of mixtures the Newcomb Benford Law works. Note that in the graph Figure 2.3 is show how close the real Law is to the simulated values. In the Model 2.4.1, instead of use the discrete version for the λ distribution here we simulate using a Uniform prior on the parameters in the Gamma distribution
  • 52. 40 function. The model that in this example is implemented goes as follows. n∼ P ois(λν) λ∼ G(α, β) (2.4.2) α, β ∼ Unif (1, 500) This simulation is an extension of the model 2.4.1. In general this is a Negative Binomial family of distributions, indeed is a mixture of distributions itself. In Figure 3 we can appreciate the histogram of the cumulative distribution(a) and the proportions of the significant digits whit the N-B Law for the first digit law proportions. Here the probability of the null hypothesis given the data is 1. The table 1 show a resume of the overall results. Example 2.4.3. The Multinomial Model is a rich source of mixtures since that if you are observing an electoral process you can seen different parameters for the probability values of each candidate per region in a country. As a little experiment suppose that you have two candidates and some of the persons in a electoral college of a particular country do not want to vote then for that particular region you will have a parameter vector p = [p1 , p2 , p3 ] whit p1 + p2 < 1 and p3 = 1 − p1 − p2 . Recall that p3 is the probability of people that do not vote for any of the candidates. For this particular simulation 1000 electoral colleges are simulated in 10 regions. As we had said there are two candidates. The joint density function of all data is presented in Figure 4. The P (H0 |data) = 1 for 29058 simulated data. A summary of the examples are presented in Table 2.4.
  • 53. 41 Table 2.2: Summary of the results of the above examples. Example Simulated length of data P (H0 |data) p − value Poisson Model 500 0 0 Pois-Gamma Discrete 250 0.991 0.008 Neg - Binomial 500 0.989 0.001 Multinomial 29058 0.999 0.002 2.5 Conclusions of the examples Note that since you complicate the hierarchy in each model then an approach to the N-B Law frequencies in the first digit can be found easily. More complicated is the model, more the approach to the N-B Law. The restrictions in the parameters affect the statistical closeness to the Benford Law.
  • 54. 42 Histogram of the Simplest Poisson Model with λ = 100. 150 Frequencies 100 50 0 60 70 80 90 100 110 120 130 140 Values Poisson model 0.8 1st Digit Law. 0.6 Empirical simulation. Proportions 0.4 0.2 0 1 2 3 4 5 6 7 8 9 Digits Figure 2.2: Newcomb-Benford Law theoretical frequencies for the first significant digit. Histogram of the Poisson model whit the partition according to the different lambda parameters. 60 Frequencies 40 20 0 0 10 20 30 40 50 60 70 80 90 Values Discrette Gamma−Poisson model 0.4 1st Digit Law. 0.3 Empirical simulation. Proportions 0.2 0.1 0 1 2 3 4 5 6 7 8 9 Digits Figure 2.3: Newcomb-Benford Law theoretical frequencies for the first significant digit. This represent the example 1 simulation results.