A summary of Colin Carroll's website highlighting his work in software engineering, machine learning, and Bayesian statistics. The site includes a blog, talks, and research projects.
Colin Carroll's website presents a portfolio of his work in software engineering and mathematics, with a strong focus on machine learning and Bayesian statistics. The content is structured into a blog…
Colin Carroll's website presents a portfolio of his work in software engineering and mathematics, with a strong focus on machine learning and Bayesian statistics. The content is structured into a blog, a selection of talks, and a showcase of his research and open-source projects.
The blog features articles on various topics, including Markov Chain Monte Carlo (MCMC) methods, Hamiltonian Monte Carlo, and probabilistic programming languages. Several posts are highlighted as particularly interesting, suggesting key insights or innovative approaches. His selected talks cover a range of topics, such as scalable Bayesian workflows implemented in JAX, the benefits and strategies for adopting static typing, and methods for effectively visualizing Bayesian models.
Carroll's research and open-source contributions are significant, with notable involvement in projects like PyMC and ArviZ. He has also developed a personal project called Bayeux, focused on state-of-the-art inference methods. Furthermore, he has contributed to TensorFlow Probability and has co-authored research papers on topics such as MCMC, Bayesian neural fields, and probabilistic time series forecasting.
Beyond his primary research and open-source contributions, the site also highlights several side projects, including tools for data visualization and computational art. These projects showcase his diverse skillset and interests, demonstrating his ability to apply his technical expertise to creative and visually engaging applications.
Dr. Juan Camilo Orduz's article explores causal inference using Probabilistic Programming Languages (PPLs) like PyMC, offering a framework to address the limitations of traditional statistical methods by enabling counterfactual reasoning. The article demonstrates these concepts using the Lalonde dataset and compares OLS and GLM models for estimating the Average Treatment Effect (ATE).
The article "Introduction to Causal Inference with PPLs" by Dr. Juan Camilo Orduz explains how Probabilistic Programming Languages (PPLs) like PyMC offer a powerful framework for causal inference, add…
The article "Introduction to Causal Inference with PPLs" by Dr. Juan Camilo Orduz explains how Probabilistic Programming Languages (PPLs) like PyMC offer a powerful framework for causal inference, addressing the limitations of traditional statistical methods that often confuse correlation with causation. Causal inference seeks to answer the question, "What would have happened if things were different?".
PPLs provide several advantages for causal inference: they allow for the natural expression of causal models, rigorous quantification of uncertainty through full posterior distributions, and direct implementation of Pearl's `do` operator for counterfactual questions. They also offer flexible modeling without sacrificing interpretability.
The article uses the Lalonde dataset, which investigates the effect of a job training program on earnings, to demonstrate building causal models with PPLs. It guides users through building models that account for confounders, estimating the Average Treatment Effect (ATE) using Bayesian inference, comparing naive versus adjusted estimates, and validating models.
The approach taken in the notebook reproduces and extends ChiRho's backdoor adjustment tutorial, implementing the causal modeling strategy using PyMC. The article details steps for data preprocessing, establishing a Causal Directed Acyclic Graph (DAG), exploratory data analysis, and scaling numerical features.
The article then delves into specifying both a linear model (OLS) and a Generalized Linear Model (GLM) in PyMC for earnings prediction, including prior predictive checks, model fitting, and diagnostics. It explains ATE estimation using both the coefficient from the OLS model and the more general `do` operator, demonstrating that both methods yield consistent results.
Finally, the notebook compares the OLS and GLM models, highlighting that the GLM, which ensures non-negative earnings through a Gamma likelihood and softplus link function, provides a better fit and tighter credible intervals for the ATE estimate. The conclusion emphasizes that while traditional OLS methods suffice for basic problems, PPLs offer superior flexibility for complex models, incorporation of prior knowledge, and model calibration, especially in scenarios with non-linear relationships or unobserved confounders.
The tool was unable to access the content of the provided URL, preventing the generation of a summary. Possible reasons include paywalls, login requirements, or website unavailability.
The attempt to browse the provided URL failed, preventing the extraction of content. Consequently, a summary or synopsis cannot be generated. The failure could stem from various access restrictions, s…
The attempt to browse the provided URL failed, preventing the extraction of content. Consequently, a summary or synopsis cannot be generated. The failure could stem from various access restrictions, such as paywalls requiring a subscription, login requirements necessitating user authentication, or the website being temporarily offline or unavailable to the browsing tool. Without access to the content, it is impossible to provide a meaningful summary of the information contained within the URL. Further investigation or alternative access methods may be required to retrieve the content successfully.
Learn to build hybrid data workflows using DuckDB and MotherDuck, enabling local execution and cloud scalability without changing tools. The course covers setting up DuckDB, querying data, transforming data with Python, and optimizing costs.
The course "DuckDB for Data Engineers: From Local to Cloud with MotherDuck" provides practical instruction on leveraging DuckDB and MotherDuck for building hybrid data workflows. DuckDB, a lightweight…
The course "DuckDB for Data Engineers: From Local to Cloud with MotherDuck" provides practical instruction on leveraging DuckDB and MotherDuck for building hybrid data workflows. DuckDB, a lightweight OLAP engine, allows for local or embedded execution, while MotherDuck extends this functionality to a managed cloud service with shared databases and elastic compute.
The curriculum covers setting up DuckDB locally to query CSV and Parquet files, creating persistent databases, and visualizing data. Students will learn how to transform data with Python for ELT processes, including cleaning, normalizing, and deriving new metrics. A key focus is comparing local versus cloud query execution, optimizing costs, and combining local and cloud data in a single workflow.
Furthermore, the course delves into Duck Lake, a lakehouse format that provides metadata, schema evolution, and transactional guarantees for Parquet files. Participants will learn how to connect Duck Lake to S3 buckets for direct SQL querying, enhancing their ability to manage and analyze data effectively.
This tutorial offers a quickstart guide to BigQuery ML, demonstrating how to build predictive models using SQL and historical horse racing data.
This Fireship.io tutorial demonstrates how to use BigQuery ML to build predictive models without extensive data science expertise. The core example focuses on predicting horse racing outcomes using hi…
This Fireship.io tutorial demonstrates how to use BigQuery ML to build predictive models without extensive data science expertise. The core example focuses on predicting horse racing outcomes using historical data.
The tutorial begins by importing data from Kaggle into BigQuery. This involves uploading the data to Google Cloud Storage and subsequently creating a dataset and table within BigQuery to house the information. Initial SQL queries are performed, with an optional step to visualize and analyze the data further using Data Studio.
The heart of the tutorial involves building a predictive ML model using DataLab, a Python notebook environment connected to BigQuery. The tutorial walks through the process of creating, evaluating, and ultimately utilizing the trained model directly within BigQuery ML. The end result is a practical demonstration of how to leverage BigQuery ML for predictive analytics using SQL.
The speaker challenges the common perception that MLOps is universally essential for machine learning deployments. He explains MLOps as an extension of DevOps, aiming to enable operations teams to adm…
The speaker challenges the common perception that MLOps is universally essential for machine learning deployments. He explains MLOps as an extension of DevOps, aiming to enable operations teams to administer ML models, a need highlighted by the "hidden technical debt" in ML systems (Scully et al., 2015). This technical debt arises because while ML model development is rapid, long-term maintenance is complex, especially when dealing with data drift, concept drift, and ensuring consistency between training and serving environments.
However, the speaker contends that the industry's current emphasis on formalizing and automating *every* step of the ML workflow (feature engineering, data validation, deployment, etc.) leads to "automation for automation's sake." This over-automation, as depicted in cloud provider architectures, creates rigid, overly complex systems that demand significant effort to maintain, diverting resources from core ML problem-solving. The speaker argues for a "keep it simple" philosophy: the primary goal is for ops personnel to maintain models, which doesn't always require continuous, bulletproof automation. Transparent processes, occasional manual steps, and upskilling ops teams are often more pragmatic.
To address common technical debt challenges, the speaker proposes simpler, modern solutions. For example, instead of elaborate feature stores or ML pipelines for training-serving skew, pre-processing logic can be embedded directly into the ML model (the "Transform Pattern" in frameworks like Keras). For multi-step workflows, individual components, including the ML model itself, can be deployed as microservices. Data and concept drift, instead of requiring complex continuous evaluation and training systems, can often be managed effectively through scheduled retraining (e.g., monthly or quarterly), accepting minor, temporary drift as a trade-off for reduced complexity. Similarly, a monthly "from scratch build" during scheduled releases can replace continuous integration/deployment for every code change. Reproducibility can be ensured with robust version control for code, data, and models.
In conclusion, the speaker asserts that the extensive MLOps solutions prevalent today are largely unnecessary for the vast majority (99%) of ML systems. He argues that modern data processing architectures (data warehouses, data lakes) and advanced ML frameworks (PyTorch, Keras) have significantly simplified many processes and intrinsically address much of the original technical debt. The push for complex, fully automated MLOps often leads to unwarranted complexity, with successful real-world ML systems tending to favor simpler, more direct approaches that prioritize impact over exhaustive automation. Complex MLOps implementations, according to the speaker, are only warranted in a very rare minority of specific, high-stakes scenarios.
Use this portfolio as inspiration for my own portfolio.
Roy Keyes' "Data projects" page is a portfolio of personal projects demonstrating his skills in data science, machine learning, and visualization. A prominent project involves using machine learning …
Roy Keyes' "Data projects" page is a portfolio of personal projects demonstrating his skills in data science, machine learning, and visualization. A prominent project involves using machine learning to estimate radiation doses for cancer therapy, which was presented at SciPy 2017. He also gave introductory talks on neural networks and deep learning for the Houston Data Science Meetup group, indicating his teaching and communication abilities.
Keyes has also developed practical tools and libraries. "Slots," a Python library, allows users to explore multi-armed bandit strategies, while a Monte Carlo simulation tackles the dice game Klackers. These projects showcase his programming and simulation skills.
His visualization expertise is evident in projects like the D3.js visualization of Chutes and Ladders using Markov chains and the "ABQ Bikeability" map, which quantifies bikability in Albuquerque. An open-source talk on "Data Science, Big Data, and other buzzwords" suggests his ability to demystify complex topics.
Further projects include analyses of UNM graduate student salaries and medical physics articles on arXiv.org, demonstrating his analytical skills and research interests. The development of "sparkmeters," small inline graphics for information design, highlights his attention to detail and commitment to effective communication.
A review of the Cloudera Fast Forward Labs report on structural time series (STS) models, focusing on their application, interpretation, and ethical considerations. The report emphasizes the decomposition of time series into interpretable components and the use of GAMs for modeling.
The document "Structural Time Series" by Cloudera Fast Forward Labs provides a comprehensive overview of structural time series (STS) models, focusing on their interpretability and practical applicati…
The document "Structural Time Series" by Cloudera Fast Forward Labs provides a comprehensive overview of structural time series (STS) models, focusing on their interpretability and practical application. It explains that STS models represent an observed time series as a combination of explicit components such as trend, seasonality, and impact effects.
The report outlines two primary approaches to STS:
* **State Space Models:** These models view the time series as being generated by unobserved (latent) dynamics, encompassing techniques like ARIMA and the Kalman filter. Open-source tools such as `bsts` in R and TensorFlow Probability's `sts` module support this formulation.
* **Generalized Additive Models (GAMs):** This approach decomposes the time series into smooth, additive functions, each representing a distinct component. GAMs are highlighted for their scalability, ease of interpretation, and ability to handle missing or irregularly spaced data. While they might be less accurate than some autoregressive methods, GAMs can be extended with Bayesian techniques to quantify forecast uncertainty.
Key components typically found in GAM-based structural time series models include:
* **Trend:** Describes the long-term upward or downward movement in the data. This can be modeled as global, local, piecewise linear, or even saturating (e.g., using a logistic function for processes with capacity limits).
* **Seasonality:** Refers to any repeating patterns at fixed intervals, such as daily, weekly, or yearly cycles. Fourier series are employed to flexibly model these periodic effects.
* **Impact Effects:** These are discrete, often sudden, changes in the time series caused by specific events like holidays. They are modeled as constant terms active only during the relevant periods.
* **External Regressors:** The model can incorporate additional external variables, such as outdoor temperature for electricity demand, to enhance predictive power. However, this may complicate model interpretability and necessitate forecasting these external factors as well.
The report also emphasizes critical considerations for evaluating time series models:
* **Forecast Horizons:** Model evaluation should align with the intended use of the forecast, whether for short-term or long-term predictions.
* **Appropriate Validation:** Standard cross-validation methods are unsuitable for time series data due to temporal dependencies. Instead, "forward chaining" or "rolling-origin" validation techniques are recommended.
* **Baselines:** Establishing a simple, well-understood baseline model (e.g., a seasonal naive forecast) is crucial for objectively measuring the performance improvements of more complex models.
* **Evaluation Metrics:** The Mean Absolute Percentage Error (MAPE) and Mean Absolute Scaled Error (MASE) are discussed, with MASE being presented as a more robust metric that inherently scales error relative to a naive baseline, making it suitable for comparing different models.
A practical demonstration involves forecasting electricity demand in California using Facebook's open-source Prophet library. Prophet implements a GAM with piecewise linear trends, multiple seasonal components using Fourier series, and holiday effects. The report illustrates an iterative model development process, including debugging techniques like analyzing forecasts against actuals, residual plots, and autocorrelation plots to identify areas for improvement. This led to refining the model to better capture increased variance during summer months and employing a multiplicative interaction by logging the demand data, resulting in improved MAPE (6.95%) and MASE (0.89) on the holdout set.
The document underscores the value of probabilistic forecasts, enabled by Prophet's uncertainty bounds. By sampling possible future scenarios, users can answer more sophisticated, risk-related questions beyond simple point predictions, such as the probability of energy demand exceeding a certain threshold. Additionally, the report explores the use of these models for backcasting, imputing missing data, and detecting anomalies.
Ethical considerations are also addressed, advocating for the use of inherently interpretable models like GAMs, especially for high-stakes decisions, as they offer transparency without necessarily compromising accuracy. The report concludes by discussing ongoing research in time series forecasting, including automated structural component discovery and the application of transformer models for multivariate time series.
This talk explores the divide between classical statistical and modern deep learning approaches in time series forecasting, presenting Nixla's open-source efforts to bridge this gap with efficient and scalable solutions.
The speaker introduces time series forecasting as fundamental to the operational DNA of the world, with applications spanning finance, IoT, electricity, supply chains, and healthcare. The field is cha…
The speaker introduces time series forecasting as fundamental to the operational DNA of the world, with applications spanning finance, IoT, electricity, supply chains, and healthcare. The field is characterized by two distinct 'mountains' or families of methods: classical statistical forecasting and deep learning approaches. The classical tradition, championed by statisticians and econometricians (e.g., ARIMA, ETS, developed by figures like Rob Hyndman), focuses on interpretable, robust models. The deep learning side, driven by machine learning practitioners, leverages neural networks (e.g., LSTMs, Transformers) for their flexibility and scalability. These two families often view each other with skepticism, criticizing each other's shortcomings.
Nixla aims to bridge this divide with two core open-source libraries: `statsforecast` and `neuralforecast`. `statsforecast` provides extremely fast and accurate implementations of classical algorithms, designed for scalability and bridging the R to Python ecosystem gap. The speaker demonstrates its superior performance and cost-effectiveness compared to popular libraries like Prophet, emphasizing that newer is not always better. On the other hand, `neuralforecast` offers a scalable, user-friendly interface for deep learning models, highlighting advantages like improved accuracy for long horizons, simpler pipelines, and the potential for transfer learning in zero-shot scenarios.
The core of Nixla's 'bridge-building' effort lies in models like N-BEATS and N-HITS. N-BEATS was an early attempt to integrate interpretability (signal decomposition) into neural networks. However, for long-horizon forecasting, both classical and early deep learning methods struggled with accuracy and computational complexity. To address this, Nixla developed N-HITS, an architecture that combines multi-rate signal sampling and hierarchical interpolation. N-HITS significantly improves accuracy and speed for long-horizon problems, outperforming even transformer-based methods, and offers theoretical connections to Fourier transforms, enabling signal decomposition and interpretability within a neural network framework.
The talk concludes by outlining Nixla's 'Further Adventures,' including hierarchical forecasting methods for reconciling forecasts across different organizational levels and a low-latency forecasting API for developers. During the Q&A, the speaker clarifies that the libraries support cross-validation for evaluation, handle multi-seasonality (MSTL in `statsforecast`, N-HITS in `neuralforecast`), and are working on features for missing data and sparse data handling through distributional methods and wavelet transform capabilities.
Dr. Thomas Wiecki, creator of PyMC, discusses his journey into probabilistic programming, the crucial intersection of Bayesian modeling and causal inference, and how PyMC is integrating new tools like the `do` operator to solve real-world problems and enhance decision-making.
In this compelling discussion, Dr. Thomas Wiecki, the driving force behind PyMC, delves into his personal and professional journey, from childhood programming experiments to developing one of Python's…
In this compelling discussion, Dr. Thomas Wiecki, the driving force behind PyMC, delves into his personal and professional journey, from childhood programming experiments to developing one of Python's most recognized probabilistic programming frameworks. He highlights the growing synergy between Bayesian modeling and causal inference, emphasizing that what Bayesians often call the 'data generative process' is fundamentally akin to structural causal modeling. A significant recent development in PyMC is the introduction of the `do` operator, enabling users to directly express and analyze interventions, a critical component for answering structural causal questions within a Bayesian framework.
Wiecki argues that the ultimate purpose of data science is to facilitate better decision-making, not just prediction. He stresses that understanding "what causes what" is paramount for effective action, a concept that resonates deeply with non-technical stakeholders and helps convey the value of Bayesian methods more effectively than discussions about priors or uncertainty. He addresses the common fear of explicitly defining model structures, reassuring that "it's great to be wrong"—an iterative process of building, testing (e.g., with posterior predictive checks), and refining models leads to profound learning and better alignment with domain expertise. This approach contrasts sharply with black-box predictive models, which often fail to provide actionable insights or explain underlying business problems.
The conversation also explores the practical advantages of Bayesian modeling in causal contexts. Unlike frequentist approaches that may struggle with variable selection in structural models, Bayesian frameworks allow direct estimation of the complete structural model, naturally accounting for phenomena like colliders. The generative nature of Bayesian models facilitates structural discovery and hypothesis testing by allowing users to simulate data from their proposed causal graphs. Furthermore, the inherent ability of Bayesian methods to quantify both aleatoric and epistemic uncertainty provides a richer understanding of risk, which is crucial for optimizing decisions—such as allocating marketing budgets—in a manner that reflects human risk aversion.
Wiecki concludes by sharing insights on career development and innovation, advocating for following one's passion, taking calculated risks, and embracing continuous exploration. He underscores the immense value of the open-source community, where collaboration drives the boundaries of what's possible. The discussion reinforces the vision for PyMC and PyMC Labs to continue integrating advanced causal inference tools, making these powerful methodologies more accessible and impactful for solving complex, real-world problems across diverse domains.
A Python application that visualizes the implied volatility surface for options, using real-time data and the Black-Scholes model. The application provides an interactive 3D surface and adjustable parameters for customization.
The "Implied-Volatility-Surface" repository by MateuszJastrzebski21 hosts a Python application designed to visualize the implied volatility surface for SPY options. The application leverages real-time…
The "Implied-Volatility-Surface" repository by MateuszJastrzebski21 hosts a Python application designed to visualize the implied volatility surface for SPY options. The application leverages real-time SPY option data fetched from Yahoo Finance to calculate implied volatility based on the Black-Scholes model. It then presents this data as an interactive 3D surface, illustrating how implied volatility varies with both time to expiration and strike price. A live demonstration of the application is available on Streamlit.
Key features include the interactive 3D volatility surface, real-time data fetching, adjustable risk-free rate, ticker symbol input, and customizable strike price filters. The application is built with Streamlit, providing a user-friendly and responsive interface. The project utilizes several Python libraries, including Streamlit for the frontend, yfinance for financial data, NumPy and Pandas for data manipulation, SciPy for implied volatility calculations, and Plotly for 3D visualization.
The project is released under the MIT License, enabling free use, modification, and distribution. Mateusz Jastrzębski is the creator of this valuable tool for options traders and financial analysts.