Skip to content

JOSS Paper for PyFixest #885

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
311 changes: 311 additions & 0 deletions joss_paper/paper.bib
Original file line number Diff line number Diff line change
@@ -0,0 +1,311 @@
@techreport{berge2018efficient,
title={Efficient estimation of maximum likelihood models with multiple fixed-effects: the R package FENmlm},
author={Berg{\'e}, Laurent},
year={2018},
institution={Department of Economics at the University of Luxembourg}
}

@article{seabold2010statsmodels,
title={Statsmodels: econometric and statistical modeling with python.},
author={Seabold, Skipper and Perktold, Josef},
journal={SciPy},
volume={7},
number={1},
pages={92--96},
year={2010}
}

@article{gaure2013lfe,
title={lfe: Linear group fixed effects},
author={Gaure, Simen},
year={2013}
}

@article{harris2020array,
title={Array programming with NumPy},
author={Harris, Charles R and Millman, K Jarrod and Van Der Walt, St{\'e}fan J and Gommers, Ralf and Virtanen, Pauli and Cournapeau, David and Wieser, Eric and Taylor, Julian and Berg, Sebastian and Smith, Nathaniel J and others},
journal={Nature},
volume={585},
number={7825},
pages={357--362},
year={2020},
publisher={Nature Publishing Group UK London}
}

@article{virtanen2020scipy,
title={SciPy 1.0: fundamental algorithms for scientific computing in Python},
author={Virtanen, Pauli and Gommers, Ralf and Oliphant, Travis E and Haberland, Matt and Reddy, Tyler and Cournapeau, David and Burovski, Evgeni and Peterson, Pearu and Weckesser, Warren and Bright, Jonathan and others},
journal={Nature methods},
volume={17},
number={3},
pages={261--272},
year={2020},
publisher={Nature Publishing Group US New York}
}

@inproceedings{lam2015numba,
title={Numba: A llvm-based python jit compiler},
author={Lam, Siu Kwan and Pitrou, Antoine and Seibert, Stanley},
booktitle={Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC},
pages={1--6},
year={2015}
}

@software{wardrop_formulaic_2024,
author = {Matthew Wardrop},
title = {Formulaic: A high-performance implementation of Wilkinson formulas for Python},
year = {2024},
version = {1.1.1},
url = {https://github.com/matthewwardrop/formulaic},
note = {Accessed: 2025-04-27}
}


@techreport{dube2023local,
title={A local projections approach to difference-in-differences event studies},
author={Dube, Arindrajit and Girardi, Daniele and Jorda, Oscar and Taylor, Alan M},
year={2023},
institution={National Bureau of Economic Research}
}

@article{roodman2019fast,
title={Fast and wild: Bootstrap inference in Stata using boottest},
author={Roodman, David and Nielsen, Morten {\O}rregaard and MacKinnon, James G and Webb, Matthew D},
journal={The Stata Journal},
volume={19},
number={1},
pages={4--60},
year={2019},
publisher={SAGE Publications Sage CA: Los Angeles, CA}
}

@article{fischer2022fwildclusterboot,
title={fwildclusterboot: Fast wild cluster bootstrap inference for linear regression models (version 0.12. 3)},
author={Fischer, Alexander and Roodman, David},
journal={URL https://cran. r-project. org/package= fwildclusterboot},
year={2022}
}

@article{hess2017randomization,
title={Randomization inference with Stata: A guide and software},
author={He{\ss}, Simon},
journal={The Stata Journal},
volume={17},
number={3},
pages={630--651},
year={2017},
publisher={SAGE Publications Sage CA: Los Angeles, CA}
}

@article{gardner2022two,
title={Two-stage differences in differences},
author={Gardner, John},
journal={arXiv preprint arXiv:2207.05943},
year={2022}
}

@article{butts2021did2s,
title={$\{$did2s$\}$: Two-stage difference-in-differences},
author={Butts, Kyle and Gardner, John},
journal={arXiv preprint arXiv:2109.05913},
year={2021}
}

@article{abadie2023should,
title={When should you adjust standard errors for clustering?},
author={Abadie, Alberto and Athey, Susan and Imbens, Guido W and Wooldridge, Jeffrey M},
journal={The Quarterly Journal of Economics},
volume={138},
number={1},
pages={1--35},
year={2023},
publisher={Oxford University Press}
}

@article{clarke2020romano,
title={The Romano--Wolf multiple-hypothesis correction in Stata},
author={Clarke, Damian and Romano, Joseph P and Wolf, Michael},
journal={The Stata Journal},
volume={20},
number={4},
pages={812--843},
year={2020},
publisher={SAGE Publications Sage CA: Los Angeles, CA}
}

@article{romano2005exact,
title={Exact and approximate stepdown methods for multiple hypothesis testing},
author={Romano, Joseph P and Wolf, Michael},
journal={Journal of the American Statistical Association},
volume={100},
number={469},
pages={94--108},
year={2005},
publisher={Taylor \& Francis}
}

@article{mackinnon2023fast,
title={Fast and reliable jackknife and bootstrap methods for cluster-robust inference},
author={MacKinnon, James G and Nielsen, Morten {\O}rregaard and Webb, Matthew D},
journal={Journal of Applied Econometrics},
volume={38},
number={5},
pages={671--694},
year={2023},
publisher={Wiley Online Library}
}

@article{busch2023lpdid,
title={Lpdid: Stata module implementing local projections difference-in-differences (lp-did) estimator},
author={Busch, Alexander and Girardi, Daniele},
year={2023},
publisher={Boston College Department of Economics}
}

@article{correia2023reghdfe,
title={REGHDFE: Stata module to perform linear or instrumental-variable regression absorbing any number of high-dimensional fixed effects},
author={Correia, Sergio},
year={2023},
publisher={Boston College Department of Economics}
}

@article{gaure2013ols,
title={OLS with multiple high dimensional category variables},
author={Gaure, Simen},
journal={Computational Statistics \& Data Analysis},
volume={66},
pages={8--18},
year={2013},
publisher={Elsevier}
}

@article{montiel2019simultaneous,
title={Simultaneous confidence bands: Theory, implementation, and an application to SVARs},
author={Montiel Olea, Jos{\'e} Luis and Plagborg-M{\o}ller, Mikkel},
journal={Journal of Applied Econometrics},
volume={34},
number={1},
pages={1--17},
year={2019},
publisher={Wiley Online Library}
}

@article{correia2020fast,
title={Fast Poisson estimation with high-dimensional fixed effects},
author={Correia, Sergio and Guimar{\~a}es, Paulo and Zylkin, Tom},
journal={The Stata Journal},
volume={20},
number={1},
pages={95--115},
year={2020},
publisher={SAGE Publications Sage CA: Los Angeles, CA}
}

@article{gautier2008rpy2,
title={rpy2: A Simple and Efficient Access to R from Python},
author={Gautier, L},
journal={URL http://rpy. sourceforge. net/rpy2. html},
volume={3},
number={1},
pages={20},
year={2008}
}

@article{lal2023much,
title={How much should we trust instrumental variable estimates in political science? Practical advice based on over 60 replicated studies},
author={Lal, Apoorva and Lockhart, Mac and Xu, Yiqing and Zu, Ziwen},
journal={arXiv preprint arXiv:2303.11399},
year={2023}
}

@article{sun2021estimating,
title={Estimating dynamic treatment effects in event studies with heterogeneous treatment effects},
author={Sun, Liyang and Abraham, Sarah},
journal={Journal of econometrics},
volume={225},
number={2},
pages={175--199},
year={2021},
publisher={Elsevier}
}

@article{stammann2017fast,
title={Fast and feasible estimation of generalized linear models with high-dimensional k-way fixed effects},
author={Stammann, Amrei},
journal={arXiv preprint arXiv:1707.01815},
year={2017}
}

@article{gautier2008rpy2,
title={rpy2: A Simple and Efficient Access to R from Python},
author={Gautier, L},
journal={URL http://rpy. sourceforge. net/rpy2. html},
volume={3},
number={1},
pages={20},
year={2008}
}

@article{wong2021you,
title={You Only Compress Once: Optimal Data Compression for Estimating Linear Models},
author={Wong, Jeffrey and Forsell, Eskil and Lewis, Randall and Mao, Tobias and Wardrop, Matthew},
journal={arXiv preprint arXiv:2102.11297},
year={2021}
}

@article{lal2024large,
title={Large Scale Longitudinal Experiments: Estimation and Inference},
author={Lal, Apoorva and Fischer, Alexander and Wardrop, Matthew},
journal={arXiv preprint arXiv:2410.09952},
year={2024}
}

@article{lal2025can,
title={When can we get away with using the two-way fixed effects regression?},
author={Lal, Apoorva},
journal={arXiv preprint arXiv:2503.05125},
year={2025}
}

@article{gelbach2016covariates,
title={When do covariates matter? And which ones, and how much?},
author={Gelbach, Jonah B},
journal={Journal of Labor Economics},
volume={34},
number={2},
pages={509--543},
year={2016},
publisher={University of Chicago Press Chicago, IL}
}

@misc{linearmodels,
author = {Kevin Sheppard},
title = {linearmodels: Linear Panel, Instrumental Variable, Asset Pricing, and Other Models},
year = {2024},
howpublished = {\url{https://github.com/bashtage/linearmodels}},
note = {Accessed: 2025-04-28}
}

@misc{stammann2020package,
title={Package ‘alpaca’},
author={Stammann, Amrei and Czarnowske, Daniel and Stammann, Maintainer Amrei},
year={2020}
}

@software{fixedeffectmodelsjl_2025,
author = {{FixedEffects}},
title = {FixedEffectModels.jl: Fast Estimation of Linear Models with IV and High Dimensional Categorical Variables},
year = {2025},
version = {1.12.0},
url = {https://github.com/FixedEffects/FixedEffectModels.jl},
note = {Accessed: 2025-04-28}
}

@software{gortmaker_pyhdfe_2023,
author = {Jeff Gortmaker and Anya Tarascina},
title = {pyhdfe: High dimensional fixed effect absorption with Python 3},
year = {2023},
version = {0.2.0},
url = {https://github.com/jeffgortmaker/pyhdfe},
note = {Accessed: 2025-04-28}
}
65 changes: 65 additions & 0 deletions joss_paper/pyfixest_joss.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
---
title: 'PyFixest: A Python Port of fixest for High-Dimensional Fixed-Effects Regression'
date: 27 April 2025
author:
- name: Alexander Fischer
affiliation: Trivago
---


## Summary

PyFixest is an open-source Python library that implements efficient routines for regression analysis with multiple potentially high-dimensional fixed effects by applying the Frisch-Waugh-Lovell theorem. It is a faithful port of the R package fixest [@berge2018], aiming to replicate fixest’s core design principles and functionality within the Python ecosystem. Users familiar with fixest in R can seamlessly transition their analysis to Python using the same syntax and obtaining identical results. Likewise, users of PyFixest should find it easy to port their analysis to R and use the fixest package.

## Statement of Need

Fixed-effects models with high-dimensional categorical effects are ubiquitous in the social sciences, where researchers often need to control for multiple levels of unobserved heterogeneity. To efficiently estimate such models, researchers typically rely on specialized software that handle the computational challenges associated with high-dimensional fixed effects via the application of the Frisch-Waugh-Lowell (FWL) theorem (@correia2023reghdfe and correia2020fast in Stata, @gaure2013lfe, @berge2018 and @stammann2020package in R, @fixedeffectmodelsjl_2025 in Julia). In practice, the FWL theorem is efficiently implemented by an iterative "within-transformation" (demeaning) approach, which avoids the need to create large matrices of dummy variables, which can be computationally expensive and memory-intensive, especially with large datasets and/or many fixed effects [see @gaure2013ols for details].

The Python eco-system lacks an optimized library for this purpose, and *PyFixest* aims to fill this gap. *statsmodels*, the dominant statistical regression library in Python [seabold2010statsmodels] does not provide OLS and GLM functionality that natively handles multiple group fixed effects in an optimized and efficient way via the FWL theorem. Instead, users either have to inefficiently one-hot encode all categorical features in large dummy matrices, or have to rely on a workaround that involves demeaning the design matrix and outcome variable via the *pyhdfe* library (@gortmaker_pyhdfe_2023) before passing the transformed data to *statsmodels* OLS class for fitting the model. GLM estimation is generally not supported via this two-step procedure, as the iterated least squares (ILS) approach used to fit such models requires that design matrix and outcome variable are demeand in every iteration (@correia2020fast, stammann2017fast). Another package, *linearmodels* provides a class for estimating linear models with multiple fixed effects, *AbsorbingLS*, which calls *pyhdfe* to perform the demeaning step.

By contrast, the R community has benefited from specialized tools like *lfe* [@gaure2013lfe] and *fixest* [@berge2018] for high-dimensional fixed effects regression. In particular, *fixest* has introduced extremely performant and user-friendly regression software to handle multiple (potentially high-dimensional) fixed effects. Relative to *pyhdfe*'s algorithms, which are all implemented via *numpy*, the demeaning algorithm of *fixest* is orders of magnitudes faster. Beyond computational efficiency, *fixest* has introduced a rich set of features for post-estimation analysis, including methods to easily summarize and plot regression results.

*PyFixest* aims to faithfully implement *fixest*'s core functionality - efficient routines for OLS, IV, and Poisson regression with fixed effects - syntax, and post-estimation functionality. Identical input arguments to the main estimation functions that both packages share - *feols*, *fepois* and *feglm* - should produce identical results in R and Python. All of *fixest*'s core defaults, including the choice of variance covariance matrices, small sample corrections, handling of singleton fixed effects, and the treatment of multicollinear variables are preserved by *PyFixest*. To ensure identical behavior, both libraries are thoroughly tested against each other using rpy2 (gautier2008rpy2).

In addition to supporting fixed-effects and instrumental variable estimation, *PyFixest* provides support for regression weights, fast poisson regression with fixed effect demeaning (@correia2020fast), and a range of modern event study estimators, including the linear projections approach (@dube2023local, @busch2023lpdid), the two-stage difference-in-differences imputation estimator (@gardner2022two, @butts2021did2s), and the fully-saturated event study estimator proposed by @sun2021estimating. The package provides comprehensive options for calculating non-standard inference, including cluster robust variance estimators (CRV1 and CRV3, see @mackinnon2023fast), wild cluster bootstrap (@roodman2019fast, @fischer2022fwildclusterboot), randomization inference (@hess2017randomization), and the causal cluster variance estimator (@abadie2023should). PyFixest also implements methods to control the family-wise error rate by implementing the Romano-Wolf correction (@clarke2020romano and @romano2005exact), and enables users to compute simultaneous confidence bands through a multiplier bootstrap approach (@montiel2019simultaneous). Additionally, *PyFixest* provides support for Gelbach's regression decomposition (@gelbach2016covariates), Lal's event study specification test (@lal2025can), and estimation strategies based on compression & sufficient statistics as described in
@wong2021you and @lal2024large.*PyFixest* also provides multiple diagnostics for weak instruments (@lal2023much). To the best of our knowledge, most of these methods are only implemented in *PyFixest* within the Python ecosystem.

Below, we provide a brief code example that showcases that PyFixest's and fixest's syntax is nearly identical, and that both packages produce identical results.

## A (brief) Code Example


```python
import pyfixest as pf
from pyfixest.utils.dgps import get_sharkfin

df = get_sharkfin()
df.head()
```

We can fit a simple regression model via the *pf.feols()* function:

```python
fit = pf.feols(
fml = "Y ~ treat | unit + year",
data = df,
vcov = "hetero"
)
pf.etable(fit, digits = 4)
```

In R and via *fixest*, we could have done the same with (almost) identical syntax:

```r
library(reticulate)
library(fixest)
df = py$df

fit = feols(
fml = Y ~ treat | unit + year,
data = df,
vcov = "hetero"
)
etable(fit)
```
*PyFixest* and *fixest* produce an identical point estimate, standard error, and R2 value, etc.
Loading