Module 2.2: Choosing the Right Independence Test

20 min Prerequisites: Module 2.1

What You'll Learn

Why the test choice matters
Decision tree for selecting the right test
Practical examples of each test
Speed vs. accuracy tradeoffs

Why Does the Test Matter?

The independence test determines HOW Tigramite checks if two variables are related.

Wrong Test	Problem
Linear test on nonlinear data	Misses relationships
Continuous test on categorical	Invalid results
Complex test on simple data	Wastes time

Bottom line: Match your test to your data!

The Decision Tree

Is your data CONTINUOUS? │ ├── YES: Are relationships LINEAR? │ │ │ ├── YES: Is noise Gaussian? │ │ ├── YES → ParCorr (fastest) │ │ └── NO → RobustParCorr │ │ │ └── NO (nonlinear): │ ├── Additive nonlinear? → GPDC │ └── General nonlinear? → CMIknn (most flexible) │ └── NO (discrete/categorical): ├── Single category variable? → Gsquared └── Multiple category variables? → CMIsymb Mixed continuous + discrete? → RegressionCI or CMIknnMixed

Test 1: ParCorr (Partial Correlation)

Use when: Linear relationships, Gaussian noise

Pros: Very fast, well-understood statistics

Cons: Misses nonlinear relationships

from tigramite.independence_tests.parcorr import ParCorr

# ParCorr is perfect for LINEAR data
np.random.seed(42)
T = 500

# Create linear relationships
X = np.zeros(T)
Y = np.zeros(T)
for t in range(1, T):
    X[t] = 0.7 * X[t-1] + np.random.randn()  # Autocorrelation
    Y[t] = 0.5 * X[t-1] + np.random.randn()  # X causes Y (linear!)

data = np.column_stack([X, Y])
dataframe = pp.DataFrame(data, var_names=['X', 'Y'])

# Use ParCorr
parcorr = ParCorr(significance='analytic')
pcmci = PCMCI(dataframe=dataframe, cond_ind_test=parcorr, verbosity=0)
results = pcmci.run_pcmci(tau_max=3, pc_alpha=0.05)

print("ParCorr detected:")
pcmci.print_significant_links(p_matrix=results['p_matrix'],
                               val_matrix=results['val_matrix'], alpha_level=0.05)

Test 2: RobustParCorr

Use when: Linear relationships, but NON-Gaussian noise (heavy tails, skewed)

Pros: Handles weird distributions

Cons: Slightly slower than ParCorr

from tigramite.independence_tests.robust_parcorr import RobustParCorr

# RobustParCorr handles non-Gaussian marginals
np.random.seed(42)
T = 500

# Create linear relationships with EXPONENTIAL noise (non-Gaussian!)
X = np.zeros(T)
Y = np.zeros(T)
for t in range(1, T):
    X[t] = 0.7 * X[t-1] + np.random.exponential(1)  # Exponential noise!
    Y[t] = 0.5 * X[t-1] + np.random.exponential(1)

data = np.column_stack([X, Y])
dataframe = pp.DataFrame(data, var_names=['X', 'Y'])

# Use RobustParCorr
robust = RobustParCorr(significance='analytic')
pcmci = PCMCI(dataframe=dataframe, cond_ind_test=robust, verbosity=0)
results = pcmci.run_pcmci(tau_max=3, pc_alpha=0.05)

Test 3: CMIknn (Conditional Mutual Information)

Use when: Nonlinear relationships, continuous data

Pros: Catches ANY type of dependency

Cons: Slower, needs more data (T > 500)

from tigramite.independence_tests.cmiknn import CMIknn

# CMIknn catches NONLINEAR relationships
np.random.seed(42)
T = 1000  # Need more data for nonparametric tests

# Create NONLINEAR relationships
X = np.zeros(T)
Y = np.zeros(T)
for t in range(1, T):
    X[t] = 0.7 * X[t-1] + np.random.randn() * 0.5
    Y[t] = np.sin(X[t-1]) + np.random.randn() * 0.3  # NONLINEAR: sin(X)!

data = np.column_stack([X, Y])
dataframe = pp.DataFrame(data, var_names=['X', 'Y'])

# Use CMIknn
cmiknn = CMIknn(significance='shuffle_test', knn=0.1)  # knn as fraction
pcmci = PCMCI(dataframe=dataframe, cond_ind_test=cmiknn, verbosity=0)
results = pcmci.run_pcmci(tau_max=3, pc_alpha=0.05)

# CMIknn will detect the nonlinear relationship that ParCorr might miss!

Important: CMIknn uses significance='shuffle_test' which is slower but necessary for nonparametric testing.

Test 4: Gsquared (for Discrete Data)

Use when: Variables are categories (e.g., Low/Medium/High, On/Off)

Pros: Designed for categorical data

Cons: Only for discrete variables

from tigramite.independence_tests.gsquared import Gsquared

# Gsquared for categorical data
np.random.seed(42)
T = 1000

# Create categorical data (0, 1, 2 representing Low, Medium, High)
X = np.zeros(T, dtype=int)
Y = np.zeros(T, dtype=int)

for t in range(1, T):
    X[t] = np.random.choice([0, 1, 2])
    # Y depends on X from previous time step
    if X[t-1] == 0:
        Y[t] = np.random.choice([0, 1, 2], p=[0.7, 0.2, 0.1])
    elif X[t-1] == 1:
        Y[t] = np.random.choice([0, 1, 2], p=[0.2, 0.6, 0.2])
    else:
        Y[t] = np.random.choice([0, 1, 2], p=[0.1, 0.2, 0.7])

data = np.column_stack([X, Y]).astype(float)
dataframe = pp.DataFrame(data, var_names=['MachineState', 'Quality'])

# Use Gsquared
gsquared = Gsquared(significance='analytic')
pcmci = PCMCI(dataframe=dataframe, cond_ind_test=gsquared, verbosity=0)
results = pcmci.run_pcmci(tau_max=3, pc_alpha=0.05)

Quick Reference

Test	Data	Relationships	Speed
ParCorr	Continuous	Linear	Fast
RobustParCorr	Continuous	Linear, non-Gaussian	Fast
CMIknn	Continuous	Nonlinear	Slow
GPDC	Continuous	Additive nonlinear	Medium
Gsquared	Discrete	Any	Fast
CMIsymb	Discrete	Multivariate	Medium
RegressionCI	Mixed	Linear + discrete	Medium

How to Import Each Test

# Linear tests (fast)
from tigramite.independence_tests.parcorr import ParCorr
from tigramite.independence_tests.robust_parcorr import RobustParCorr

# Nonlinear tests (flexible but slower)
from tigramite.independence_tests.cmiknn import CMIknn
from tigramite.independence_tests.gpdc import GPDC

# Discrete/categorical tests
from tigramite.independence_tests.gsquared import Gsquared
from tigramite.independence_tests.cmisymb import CMIsymb

# Mixed data
from tigramite.independence_tests.regressionCI import RegressionCI

Significance Methods

The significance parameter controls how p-values are computed:

'analytic' - Fast, uses mathematical formulas (ParCorr, Gsquared)
'shuffle_test' - Slower, uses permutation testing (CMIknn) - more flexible but computationally intensive

Practical Tips

Start with ParCorr - It's fast and works well for many real-world datasets
Check linearity first - Use tp.plot_scatterplots() to visualize relationships
Use CMIknn when unsure - It's the most flexible (but slowest)
Need more data for nonparametric tests - CMIknn typically needs T > 500
Match test to data type - Gsquared for discrete, ParCorr/CMIknn for continuous

← Previous

2.1 Data Preparation

2.3 Choosing Methods