KAP - Capture URL

Colin Carroll blog

A summary of Colin Carroll's website highlighting his work in software engineering, machine learning, and Bayesian statistics. The site includes a blog, talks, and research projects.

Colin Carroll's website presents a portfolio of his work in software engineering and mathematics, with a strong focus on machine learning and Bayesian statistics. The content is structured into a blog…

bayesian blog github code skim

added 2026-01-13 understand mcmc

Introduction to Causal Inference with PPLs

Dr. Juan Camilo Orduz's article explores causal inference using Probabilistic Programming Languages (PPLs) like PyMC, offering a framework to address the limitations of traditional statistical methods by enabling counterfactual reasoning. The article demonstrates these concepts using the Lalonde dataset and compares OLS and GLM models for estimating the Average Treatment Effect (ATE).

The article "Introduction to Causal Inference with PPLs" by Dr. Juan Camilo Orduz explains how Probabilistic Programming Languages (PPLs) like PyMC offer a powerful framework for causal inference, add… The article "Introduction to Causal Inference with PPLs" by Dr. Juan Camilo Orduz explains how Probabilistic Programming Languages (PPLs) like PyMC offer a powerful framework for causal inference, addressing the limitations of traditional statistical methods that often confuse correlation with causation. Causal inference seeks to answer the question, "What would have happened if things were different?". PPLs provide several advantages for causal inference: they allow for the natural expression of causal models, rigorous quantification of uncertainty through full posterior distributions, and direct implementation of Pearl's `do` operator for counterfactual questions. They also offer flexible modeling without sacrificing interpretability. The article uses the Lalonde dataset, which investigates the effect of a job training program on earnings, to demonstrate building causal models with PPLs. It guides users through building models that account for confounders, estimating the Average Treatment Effect (ATE) using Bayesian inference, comparing naive versus adjusted estimates, and validating models. The approach taken in the notebook reproduces and extends ChiRho's backdoor adjustment tutorial, implementing the causal modeling strategy using PyMC. The article details steps for data preprocessing, establishing a Causal Directed Acyclic Graph (DAG), exploratory data analysis, and scaling numerical features. The article then delves into specifying both a linear model (OLS) and a Generalized Linear Model (GLM) in PyMC for earnings prediction, including prior predictive checks, model fitting, and diagnostics. It explains ATE estimation using both the coefficient from the OLS model and the more general `do` operator, demonstrating that both methods yield consistent results. Finally, the notebook compares the OLS and GLM models, highlighting that the GLM, which ensures non-negative earnings through a Gamma likelihood and softplus link function, provides a better fit and tighter credible intervals for the ATE estimate. The conclusion emphasizes that while traditional OLS methods suffice for basic problems, PPLs offer superior flexibility for complex models, incorporation of prior knowledge, and model calibration, especially in scenarios with non-linear relationships or unobserved confounders.

article bayesian github code summarize

added 2026-01-12 learn from this

Real vs Synthetic Consumers

The tool was unable to access the content of the provided URL, preventing the generation of a summary. Possible reasons include paywalls, login requirements, or website unavailability.

The attempt to browse the provided URL failed, preventing the extraction of content. Consequently, a summary or synopsis cannot be generated. The failure could stem from various access restrictions, s…

article do summarize

added 2026-01-11 proposal for Cogent

DuckDB for Data Engineers

Learn to build hybrid data workflows using DuckDB and MotherDuck, enabling local execution and cloud scalability without changing tools. The course covers setting up DuckDB, querying data, transforming data with Python, and optimizing costs.

The course "DuckDB for Data Engineers: From Local to Cloud with MotherDuck" provides practical instruction on leveraging DuckDB and MotherDuck for building hybrid data workflows. DuckDB, a lightweight…

course do

added 2026-01-11 better duckdb

Predict Horse Races with BigQuery ML

This tutorial offers a quickstart guide to BigQuery ML, demonstrating how to build predictive models using SQL and historical horse racing data.

This Fireship.io tutorial demonstrates how to use BigQuery ML to build predictive models without extensive data science expertise. The core example focuses on predicting horse racing outcomes using hi…

article video code publish

added 2026-01-11 modernize this ml use case

You Don't Need MLOps

Valliappa (Lak) Lakshmanan talks MLOps

The speaker challenges the common perception that MLOps is universally essential for machine learning deployments. He explains MLOps as an extension of DevOps, aiming to enable operations teams to adm… The speaker challenges the common perception that MLOps is universally essential for machine learning deployments. He explains MLOps as an extension of DevOps, aiming to enable operations teams to administer ML models, a need highlighted by the "hidden technical debt" in ML systems (Scully et al., 2015). This technical debt arises because while ML model development is rapid, long-term maintenance is complex, especially when dealing with data drift, concept drift, and ensuring consistency between training and serving environments. However, the speaker contends that the industry's current emphasis on formalizing and automating *every* step of the ML workflow (feature engineering, data validation, deployment, etc.) leads to "automation for automation's sake." This over-automation, as depicted in cloud provider architectures, creates rigid, overly complex systems that demand significant effort to maintain, diverting resources from core ML problem-solving. The speaker argues for a "keep it simple" philosophy: the primary goal is for ops personnel to maintain models, which doesn't always require continuous, bulletproof automation. Transparent processes, occasional manual steps, and upskilling ops teams are often more pragmatic. To address common technical debt challenges, the speaker proposes simpler, modern solutions. For example, instead of elaborate feature stores or ML pipelines for training-serving skew, pre-processing logic can be embedded directly into the ML model (the "Transform Pattern" in frameworks like Keras). For multi-step workflows, individual components, including the ML model itself, can be deployed as microservices. Data and concept drift, instead of requiring complex continuous evaluation and training systems, can often be managed effectively through scheduled retraining (e.g., monthly or quarterly), accepting minor, temporary drift as a trade-off for reduced complexity. Similarly, a monthly "from scratch build" during scheduled releases can replace continuous integration/deployment for every code change. Reproducibility can be ensured with robust version control for code, data, and models. In conclusion, the speaker asserts that the extensive MLOps solutions prevalent today are largely unnecessary for the vast majority (99%) of ML systems. He argues that modern data processing architectures (data warehouses, data lakes) and advanced ML frameworks (PyTorch, Keras) have significantly simplified many processes and intrinsically address much of the original technical debt. The push for complex, fully automated MLOps often leads to unwarranted complexity, with successful real-world ML systems tending to favor simpler, more direct approaches that prioritize impact over exhaustive automation. Complex MLOps implementations, according to the speaker, are only warranted in a very rare minority of specific, high-stakes scenarios.

video youtube do

added 2026-01-09 turn into MLOps proposal for pdm

Roy Kenes Projects Portfolio

Use this portfolio as inspiration for my own portfolio.

Roy Keyes' "Data projects" page is a portfolio of personal projects demonstrating his skills in data science, machine learning, and visualization. A prominent project involves using machine learning …

blog code do publish

added 2026-01-09 portfolio

Structural Time Series

A review of the Cloudera Fast Forward Labs report on structural time series (STS) models, focusing on their application, interpretation, and ethical considerations. The report emphasizes the decomposition of time series into interpretable components and the use of GAMs for modeling.

The document "Structural Time Series" by Cloudera Fast Forward Labs provides a comprehensive overview of structural time series (STS) models, focusing on their interpretability and practical applicati… The document "Structural Time Series" by Cloudera Fast Forward Labs provides a comprehensive overview of structural time series (STS) models, focusing on their interpretability and practical application. It explains that STS models represent an observed time series as a combination of explicit components such as trend, seasonality, and impact effects. The report outlines two primary approaches to STS: * **State Space Models:** These models view the time series as being generated by unobserved (latent) dynamics, encompassing techniques like ARIMA and the Kalman filter. Open-source tools such as `bsts` in R and TensorFlow Probability's `sts` module support this formulation. * **Generalized Additive Models (GAMs):** This approach decomposes the time series into smooth, additive functions, each representing a distinct component. GAMs are highlighted for their scalability, ease of interpretation, and ability to handle missing or irregularly spaced data. While they might be less accurate than some autoregressive methods, GAMs can be extended with Bayesian techniques to quantify forecast uncertainty. Key components typically found in GAM-based structural time series models include: * **Trend:** Describes the long-term upward or downward movement in the data. This can be modeled as global, local, piecewise linear, or even saturating (e.g., using a logistic function for processes with capacity limits). * **Seasonality:** Refers to any repeating patterns at fixed intervals, such as daily, weekly, or yearly cycles. Fourier series are employed to flexibly model these periodic effects. * **Impact Effects:** These are discrete, often sudden, changes in the time series caused by specific events like holidays. They are modeled as constant terms active only during the relevant periods. * **External Regressors:** The model can incorporate additional external variables, such as outdoor temperature for electricity demand, to enhance predictive power. However, this may complicate model interpretability and necessitate forecasting these external factors as well. The report also emphasizes critical considerations for evaluating time series models: * **Forecast Horizons:** Model evaluation should align with the intended use of the forecast, whether for short-term or long-term predictions. * **Appropriate Validation:** Standard cross-validation methods are unsuitable for time series data due to temporal dependencies. Instead, "forward chaining" or "rolling-origin" validation techniques are recommended. * **Baselines:** Establishing a simple, well-understood baseline model (e.g., a seasonal naive forecast) is crucial for objectively measuring the performance improvements of more complex models. * **Evaluation Metrics:** The Mean Absolute Percentage Error (MAPE) and Mean Absolute Scaled Error (MASE) are discussed, with MASE being presented as a more robust metric that inherently scales error relative to a naive baseline, making it suitable for comparing different models. A practical demonstration involves forecasting electricity demand in California using Facebook's open-source Prophet library. Prophet implements a GAM with piecewise linear trends, multiple seasonal components using Fourier series, and holiday effects. The report illustrates an iterative model development process, including debugging techniques like analyzing forecasts against actuals, residual plots, and autocorrelation plots to identify areas for improvement. This led to refining the model to better capture increased variance during summer months and employing a multiplicative interaction by logging the demand data, resulting in improved MAPE (6.95%) and MASE (0.89) on the holdout set. The document underscores the value of probabilistic forecasts, enabled by Prophet's uncertainty bounds. By sampling possible future scenarios, users can answer more sophisticated, risk-related questions beyond simple point predictions, such as the probability of energy demand exceeding a certain threshold. Additionally, the report explores the use of these models for backcasting, imputing missing data, and detecting anomalies. Ethical considerations are also addressed, advocating for the use of inherently interpretable models like GAMs, especially for high-stakes decisions, as they offer transparency without necessarily compromising accuracy. The report concludes by discussing ongoing research in time series forecasting, including automated structural component discovery and the application of transformer models for multivariate time series.

article timeseries code

added 2026-01-08 time series basis

NixtlaVerse, bridging the gap between statistics and deep learning for time series

This talk explores the divide between classical statistical and modern deep learning approaches in time series forecasting, presenting Nixla's open-source efforts to bridge this gap with efficient and scalable solutions.

The speaker introduces time series forecasting as fundamental to the operational DNA of the world, with applications spanning finance, IoT, electricity, supply chains, and healthcare. The field is cha… The speaker introduces time series forecasting as fundamental to the operational DNA of the world, with applications spanning finance, IoT, electricity, supply chains, and healthcare. The field is characterized by two distinct 'mountains' or families of methods: classical statistical forecasting and deep learning approaches. The classical tradition, championed by statisticians and econometricians (e.g., ARIMA, ETS, developed by figures like Rob Hyndman), focuses on interpretable, robust models. The deep learning side, driven by machine learning practitioners, leverages neural networks (e.g., LSTMs, Transformers) for their flexibility and scalability. These two families often view each other with skepticism, criticizing each other's shortcomings. Nixla aims to bridge this divide with two core open-source libraries: `statsforecast` and `neuralforecast`. `statsforecast` provides extremely fast and accurate implementations of classical algorithms, designed for scalability and bridging the R to Python ecosystem gap. The speaker demonstrates its superior performance and cost-effectiveness compared to popular libraries like Prophet, emphasizing that newer is not always better. On the other hand, `neuralforecast` offers a scalable, user-friendly interface for deep learning models, highlighting advantages like improved accuracy for long horizons, simpler pipelines, and the potential for transfer learning in zero-shot scenarios. The core of Nixla's 'bridge-building' effort lies in models like N-BEATS and N-HITS. N-BEATS was an early attempt to integrate interpretability (signal decomposition) into neural networks. However, for long-horizon forecasting, both classical and early deep learning methods struggled with accuracy and computational complexity. To address this, Nixla developed N-HITS, an architecture that combines multi-rate signal sampling and hierarchical interpolation. N-HITS significantly improves accuracy and speed for long-horizon problems, outperforming even transformer-based methods, and offers theoretical connections to Fourier transforms, enabling signal decomposition and interpretability within a neural network framework. The talk concludes by outlining Nixla's 'Further Adventures,' including hierarchical forecasting methods for reconciling forecasts across different organizational levels and a low-latency forecasting API for developers. During the Q&A, the speaker clarifies that the libraries support cross-validation for evaluation, handle multi-seasonality (MSTL in `statsforecast`, N-HITS in `neuralforecast`), and are working on features for missing data and sparse data handling through distributional methods and wavelet transform capabilities.

timeseries video youtube code publish

added 2026-01-05 improve time series analysis

Bayesian Causal Inference and PyMC: A Conversation with Thomas Wiecki

Dr. Thomas Wiecki, creator of PyMC, discusses his journey into probabilistic programming, the crucial intersection of Bayesian modeling and causal inference, and how PyMC is integrating new tools like the `do` operator to solve real-world problems and enhance decision-making.

In this compelling discussion, Dr. Thomas Wiecki, the driving force behind PyMC, delves into his personal and professional journey, from childhood programming experiments to developing one of Python's… In this compelling discussion, Dr. Thomas Wiecki, the driving force behind PyMC, delves into his personal and professional journey, from childhood programming experiments to developing one of Python's most recognized probabilistic programming frameworks. He highlights the growing synergy between Bayesian modeling and causal inference, emphasizing that what Bayesians often call the 'data generative process' is fundamentally akin to structural causal modeling. A significant recent development in PyMC is the introduction of the `do` operator, enabling users to directly express and analyze interventions, a critical component for answering structural causal questions within a Bayesian framework. Wiecki argues that the ultimate purpose of data science is to facilitate better decision-making, not just prediction. He stresses that understanding "what causes what" is paramount for effective action, a concept that resonates deeply with non-technical stakeholders and helps convey the value of Bayesian methods more effectively than discussions about priors or uncertainty. He addresses the common fear of explicitly defining model structures, reassuring that "it's great to be wrong"—an iterative process of building, testing (e.g., with posterior predictive checks), and refining models leads to profound learning and better alignment with domain expertise. This approach contrasts sharply with black-box predictive models, which often fail to provide actionable insights or explain underlying business problems. The conversation also explores the practical advantages of Bayesian modeling in causal contexts. Unlike frequentist approaches that may struggle with variable selection in structural models, Bayesian frameworks allow direct estimation of the complete structural model, naturally accounting for phenomena like colliders. The generative nature of Bayesian models facilitates structural discovery and hypothesis testing by allowing users to simulate data from their proposed causal graphs. Furthermore, the inherent ability of Bayesian methods to quantify both aleatoric and epistemic uncertainty provides a richer understanding of risk, which is crucial for optimizing decisions—such as allocating marketing budgets—in a manner that reflects human risk aversion. Wiecki concludes by sharing insights on career development and innovation, advocating for following one's passion, taking calculated risks, and embracing continuous exploration. He underscores the immense value of the open-source community, where collaboration drives the boundaries of what's possible. The discussion reinforces the vision for PyMC and PyMC Labs to continue integrating advanced causal inference tools, making these powerful methodologies more accessible and impactful for solving complex, real-world problems across diverse domains.

video youtube code publish

added 2026-01-05 improve causal inference

Implied Volatility Surface for SPY Options

A Python application that visualizes the implied volatility surface for options, using real-time data and the Black-Scholes model. The application provides an interactive 3D surface and adjustable parameters for customization.

The "Implied-Volatility-Surface" repository by MateuszJastrzebski21 hosts a Python application designed to visualize the implied volatility surface for SPY options. The application leverages real-time…

github repo code publish

added 2026-01-05 learn more about volatility