Module 2.2: Choosing the Right Independence Test

20 min Prerequisites: Module 2.1

What You'll Learn

  1. Why the test choice matters
  2. Decision tree for selecting the right test
  3. Practical examples of each test
  4. Speed vs. accuracy tradeoffs

Why Does the Test Matter?

The independence test determines HOW Tigramite checks if two variables are related.

Wrong TestProblem
Linear test on nonlinear dataMisses relationships
Continuous test on categoricalInvalid results
Complex test on simple dataWastes time
Bottom line: Match your test to your data!

The Decision Tree

Is your data CONTINUOUS? │ ├── YES: Are relationships LINEAR? │ │ │ ├── YES: Is noise Gaussian? │ │ ├── YES → ParCorr (fastest) │ │ └── NO → RobustParCorr │ │ │ └── NO (nonlinear): │ ├── Additive nonlinear? → GPDC │ └── General nonlinear? → CMIknn (most flexible) │ └── NO (discrete/categorical): ├── Single category variable? → Gsquared └── Multiple category variables? → CMIsymb Mixed continuous + discrete? → RegressionCI or CMIknnMixed

Test 1: ParCorr (Partial Correlation)

Use when: Linear relationships, Gaussian noise

Pros: Very fast, well-understood statistics

Cons: Misses nonlinear relationships

from tigramite.independence_tests.parcorr import ParCorr

# ParCorr is perfect for LINEAR data
np.random.seed(42)
T = 500

# Create linear relationships
X = np.zeros(T)
Y = np.zeros(T)
for t in range(1, T):
    X[t] = 0.7 * X[t-1] + np.random.randn()  # Autocorrelation
    Y[t] = 0.5 * X[t-1] + np.random.randn()  # X causes Y (linear!)

data = np.column_stack([X, Y])
dataframe = pp.DataFrame(data, var_names=['X', 'Y'])

# Use ParCorr
parcorr = ParCorr(significance='analytic')
pcmci = PCMCI(dataframe=dataframe, cond_ind_test=parcorr, verbosity=0)
results = pcmci.run_pcmci(tau_max=3, pc_alpha=0.05)

print("ParCorr detected:")
pcmci.print_significant_links(p_matrix=results['p_matrix'],
                               val_matrix=results['val_matrix'], alpha_level=0.05)

Test 2: RobustParCorr

Use when: Linear relationships, but NON-Gaussian noise (heavy tails, skewed)

Pros: Handles weird distributions

Cons: Slightly slower than ParCorr

from tigramite.independence_tests.robust_parcorr import RobustParCorr

# RobustParCorr handles non-Gaussian marginals
np.random.seed(42)
T = 500

# Create linear relationships with EXPONENTIAL noise (non-Gaussian!)
X = np.zeros(T)
Y = np.zeros(T)
for t in range(1, T):
    X[t] = 0.7 * X[t-1] + np.random.exponential(1)  # Exponential noise!
    Y[t] = 0.5 * X[t-1] + np.random.exponential(1)

data = np.column_stack([X, Y])
dataframe = pp.DataFrame(data, var_names=['X', 'Y'])

# Use RobustParCorr
robust = RobustParCorr(significance='analytic')
pcmci = PCMCI(dataframe=dataframe, cond_ind_test=robust, verbosity=0)
results = pcmci.run_pcmci(tau_max=3, pc_alpha=0.05)

Test 3: CMIknn (Conditional Mutual Information)

Use when: Nonlinear relationships, continuous data

Pros: Catches ANY type of dependency

Cons: Slower, needs more data (T > 500)

from tigramite.independence_tests.cmiknn import CMIknn

# CMIknn catches NONLINEAR relationships
np.random.seed(42)
T = 1000  # Need more data for nonparametric tests

# Create NONLINEAR relationships
X = np.zeros(T)
Y = np.zeros(T)
for t in range(1, T):
    X[t] = 0.7 * X[t-1] + np.random.randn() * 0.5
    Y[t] = np.sin(X[t-1]) + np.random.randn() * 0.3  # NONLINEAR: sin(X)!

data = np.column_stack([X, Y])
dataframe = pp.DataFrame(data, var_names=['X', 'Y'])

# Use CMIknn
cmiknn = CMIknn(significance='shuffle_test', knn=0.1)  # knn as fraction
pcmci = PCMCI(dataframe=dataframe, cond_ind_test=cmiknn, verbosity=0)
results = pcmci.run_pcmci(tau_max=3, pc_alpha=0.05)

# CMIknn will detect the nonlinear relationship that ParCorr might miss!
Important: CMIknn uses significance='shuffle_test' which is slower but necessary for nonparametric testing.

Test 4: Gsquared (for Discrete Data)

Use when: Variables are categories (e.g., Low/Medium/High, On/Off)

Pros: Designed for categorical data

Cons: Only for discrete variables

from tigramite.independence_tests.gsquared import Gsquared

# Gsquared for categorical data
np.random.seed(42)
T = 1000

# Create categorical data (0, 1, 2 representing Low, Medium, High)
X = np.zeros(T, dtype=int)
Y = np.zeros(T, dtype=int)

for t in range(1, T):
    X[t] = np.random.choice([0, 1, 2])
    # Y depends on X from previous time step
    if X[t-1] == 0:
        Y[t] = np.random.choice([0, 1, 2], p=[0.7, 0.2, 0.1])
    elif X[t-1] == 1:
        Y[t] = np.random.choice([0, 1, 2], p=[0.2, 0.6, 0.2])
    else:
        Y[t] = np.random.choice([0, 1, 2], p=[0.1, 0.2, 0.7])

data = np.column_stack([X, Y]).astype(float)
dataframe = pp.DataFrame(data, var_names=['MachineState', 'Quality'])

# Use Gsquared
gsquared = Gsquared(significance='analytic')
pcmci = PCMCI(dataframe=dataframe, cond_ind_test=gsquared, verbosity=0)
results = pcmci.run_pcmci(tau_max=3, pc_alpha=0.05)

Quick Reference

TestDataRelationshipsSpeed
ParCorrContinuousLinearFast
RobustParCorrContinuousLinear, non-GaussianFast
CMIknnContinuousNonlinearSlow
GPDCContinuousAdditive nonlinearMedium
GsquaredDiscreteAnyFast
CMIsymbDiscreteMultivariateMedium
RegressionCIMixedLinear + discreteMedium

How to Import Each Test

# Linear tests (fast)
from tigramite.independence_tests.parcorr import ParCorr
from tigramite.independence_tests.robust_parcorr import RobustParCorr

# Nonlinear tests (flexible but slower)
from tigramite.independence_tests.cmiknn import CMIknn
from tigramite.independence_tests.gpdc import GPDC

# Discrete/categorical tests
from tigramite.independence_tests.gsquared import Gsquared
from tigramite.independence_tests.cmisymb import CMIsymb

# Mixed data
from tigramite.independence_tests.regressionCI import RegressionCI

Significance Methods

The significance parameter controls how p-values are computed:

  • 'analytic' - Fast, uses mathematical formulas (ParCorr, Gsquared)
  • 'shuffle_test' - Slower, uses permutation testing (CMIknn) - more flexible but computationally intensive

Practical Tips

  1. Start with ParCorr - It's fast and works well for many real-world datasets
  2. Check linearity first - Use tp.plot_scatterplots() to visualize relationships
  3. Use CMIknn when unsure - It's the most flexible (but slowest)
  4. Need more data for nonparametric tests - CMIknn typically needs T > 500
  5. Match test to data type - Gsquared for discrete, ParCorr/CMIknn for continuous