# Correlations

The study of categorical data prevents the usage of standard tools like the autocorrelation function, as they are often not defined. The following functions provide ways to study categorical serial dependences.

Most of these methods are described in C. Weiss's book "An Introduction to Discrete-Valued Time Series" (2018).

## Main functions

**cramer_coefficient — Function**

```
cramer_coefficient(series, lags)
```

Measures average association between elements of `series`

at time t and time t + `lags`

. Cramer's V is an unsigned measurement : its values lies in [0,1], 0 being perfect independence and 1 perfect dependence. k can be biased, for more informations, refer to [1].

Parameters:

series(Array{Any,1}): 1-D Array containing input categorical time-series.lags(Array{Int,1}): lag values at which cramer's coefficient is computed. Alternatively,`lags`

can be an integer, a single integer value will then be returned.

Returns:`V`

, the value of cramer's coefficient for each value in`lags`

.

**cohen_coefficient — Function**

```
cohen_coefficient(series, lags)
```

Measures average association between elements of `series`

at time t and time t + `lags`

.
Cohen's k is a signed measurement : its values lie in [-pe/(1 -pe), 1], with positive (negative) values indicating positive (negative) serial dependence at `lags`

. pe is probability of agreement by chance.

Parameters:

series(Array{Any,1}): 1-D Array containing input categorical time-series.lags(Array{Int,1}): lag values at which Cohen's coefficient is computed. Alternatively,`lags`

can be an integer, a single integer value will then be returned.

Returns:`K`

, the value of Cohen's coefficient for each value in`lags`

.

**theils_u — Function**

```
theils_u(series, Lags)
```

Measures average portion of information known about `series`

at t + `lags`

given that `series`

is known at time t. Theil's U makes use of concepts borrowed from *information theory*
U is an unsigned measurement: its values lies in [0,1], 0 meaning no information shared and 1 complete knowledge (determinism).

Parameters:

series(Array{Any,1}): 1-D Array containing input categorical time-series.lags(Array{Int,1}): lag values at which Theil's U is computed. Alternatively,`lags`

can be an integer, a single integer value will then be returned.

Returns:`U`

, the value of Theil's U for each value in`lags`

.

## Confidence interval

Depending on the length of the time-series and the method used, the estimated value of serial dependence might fluctuate a lot around its true value. It is therefore useful to relate estimations to a corresponding confidence interval to know how significant given results are. The following function provides a confidence interval via bootstrap:

**bootstrap_CI — Function**

```
bootstrap_CI(series, lags, coef_func, n_iter = 1000, interval_size = 0.95)
```

Returns a top and bottom limit of the confidence interval at values of `lags`

. The width of the confidence interval can be choosen (defaults to 95%). The returned confidence interval corresponds to the null hypothesis (no serial dependence), if the estimated serial dependence lies in this interval, no significant correlations can be claimed.

Parameters:

series(Array{Any,1}): 1-D Array containing input categorical time-series.lags(Array{Int,1}): lag values at which the CI is computed.coef_func(function): the function for which the CI needs to be computed.`coef_func`

can be one of the followingfunctions:`cramer_coefficient`

,`cohen_coefficient`

or`theils_U`

.n_iter(Int): number of iterations for the bootstrap procedure. The higher, the more precise but more computationaly demanding. Defaults to 1000.interval_size(Float): Desired size of the confidence interval. Defaults to 0.95, for a 95% confidence interval.

Returns:`(top_values, bottom_values)`

, the top and bottom limit for confidence interval, for each point in`lags`

.

## Example

Using the Pewee birdsong data (1943) one can do a serial dependence plot using Cohen's cofficient as follow :

```
using DelimitedFiles, Plots
using CategoricalTimeSeries
#reading 'pewee' time-series test folder.
data_path = joinpath(dirname(dirname(pathof(CategoricalTimeSeries))), "test", "pewee.txt")
series = readdlm(data_path,',')[1,:]
lags = collect(1:25)
v = cohen_coefficient(series, lags)
t, b = bootstrap_CI(series, lags, cohen_coefficient)
a = plot(lags, v, xlabel = "Lags", ylabel = "K", label = "Cohen's k")
plot!(a, lags, t, color = "red", label = "Limits of 95% CI"); plot!(a, lags, b, color = "red", label = "", dpi = 600)
```