This is the coding illustration of the first part of my job market paper.
I propose a conditional density estimator that is
Why do you need to estimate conditional density?
"What's the difference between your estimator and Izibicki and Lee (2017)?"
Note: If you want to dive deep into the theory, refer to my job market paper for more details.
Suppose we want to learn the likelihood of some variable $Y$ given covariates $X$, i.e. the conditional density $f_{Y|X}$.
Assuming you have some statistics background, one way of estimating the conditional density relies on the following identity/definition:
$$ f_{Y|X}(y|x) = \frac{f_{Y,X}(y,x)}{f_{X}(x)}$$where $f_{Y,X}$ and $f_X$ denote the joint density of $(Y,X)$ and marginal density of $X$ respectively. One popular method relies on using kernel method to estimate $f_{Y,X}$ and $f_X$ separately, and then form the ratio. However, for $X$ with even just moderate dimension, the curse of dimensionality kicks in quickly.
In contrast, my method relies on the following representation
$$ f_{Y|X}(y|x) = \sum_{j=1}^\infty E[\phi_j(Y)|X]\phi_j(Y) $$where $\{\phi_j\}_{j=1}^\infty$ are specific functions chosen by researchers, known as orthonormal basis.
You may wonder, this is a rather daunting expression and seems more complicated than before, how does this help us?
The key insight from this expression is that, we have effectively converted the problem of estimating conditional density to a problem of estimating many conditional means, which can be estimated using any state-of-the-art machine learning estimators.
A short comment on the basis:
The series representation of the conditional density relies on the basis functions $\{\phi_j\}_{j=1}^\infty$. In general, depending on the range of $Y$, there are many different choices of bases to use.
In the example below, we assume $Y$ is bounded and takes values on $[0,1]$. We consider a very simple orthonormal basis, the cosine basis: $$ \phi_1(x) = 1;\quad \phi_j(x) = \sqrt{2}\cos(\pi(j-1)x),\quad j=2,3,\cdots$$
My estimator takes the following form (we use the hat ^
notation to denote estimators)
and we need to emphasize two main features:
for $j=1,\cdots, J$, we need to estimate the conditional mean(s) $E[\phi_j(Y)|X]$
we also need a data-driven way of choosing the optimal series cutoff $J$
For illustration purpose, let's consider a 2-fold CV. Easily generalizable to a general $K$-fold CV.
step 1: split data into training and validating samples;
step 2: for a large p, use training sample to train $\hat{f}_J$ for all $J=1,\cdots p$;
step 3: use validating sample to computing certain empirical risk;
step 4: swap training and validating samples, repeat step 2,3;
step 5: find the optimal cutoff $\hat{J}$ that minimizes averaged empirical risk;
step 6: average the trained $\hat{f}_{\hat{J}}$, obtain final estimator $\bar{f}$.
A key contribution of my paper is showing that this estimator $\bar{f}$ is in fact optimal! (asymptotically equivalent to $\hat{f}_{J^*}$ for the best possible $J^*$).
Note: See below for a short discussion on the empirical risk we use in the cross-validation.
Moreover, unlike the usual prediction problem where we can compare the predicted value against the actual value, in my setting, the object we are trying to estimate is the unknown conditional density. So how to do cross-validation in this setting?
Suppose that we are trying to find $\hat{f}$ that minimizes the integrated squared error:
$$ E_X[\int (\hat{f}(y,X) - f_{Y|X}(y|X)^2 dy]$$which is equivalent to minimizing
$$ R(\hat{f}) := E_X[\int (\hat{f}(y,X) - f_{Y|X}(y|X)^2 dy] - E_X[\int f_{Y|X}^2(y|X)dy]$$This is not something that we can work with since $f_{Y|X}$ is unknown.
In the paper, we show a very convenient fact, $R(\hat{f})$ can actually be written as an expression without the unknown $f_{Y|X}$:
$$ R(\hat{f}) = E[\int \hat{f}^2(y,X) dy - 2\hat{f}(Y,X)] $$and we can work with the sample analogue
$$ R_n(\hat{f}) = \frac{1}{n}\sum_{i=1}^n \int \hat{f}^2(y,X_i) dy - 2\hat{f}(Y_i,X_i). $$In our case with $\hat{f}_J$, this expression can be even further simplified as
$$ R_n(\hat{f}_J) = \frac{1}{n}\sum_{i=1}^n \sum_{j=1}^J (\hat{E}[\phi_j(Y)|X_i])^2 - 2\hat{f}_J(Y_i,X_i). $$which avoids integral altogether (a consequence of the orthonormal basis).
This is the empirical risk that we are going to use in our cross-validation procedure.
import pandas as pd
import numpy as np
from copy import deepcopy
from sklearn.linear_model import Lasso, LassoCV, LogisticRegression, LogisticRegressionCV
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import integrate
# part 1: define the score function (average empirical risk in cross-validation)
# x is data,
# xs is the KFold split index
# J is the series cutoff
# h is a list of lists of the estimated conditional means,
# i.e. h[i] is a list of estimated conditional means associated with i-th split
# k is the k-fold
# this score function will go inside the estimator
# the naive estimator fJ inside the loss function:
def fJ(x,y,J,h):
sum1 = 1
for j in range(1,J):
sum1 = sum1 + (h[j].predict(x))*(2**0.5)*np.cos((np.pi)*j*y)
return sum1
# loss function used to construct the score
def loss(x,y,J,h):
summ = 0
for j in range(J):
summ = summ + (h[j].predict(x))**2
return summ - 2*fJ(x,y,J,h)
# the score function
def score(x,xs,J,h,k):
score_ = []
dx = x.shape[1]
for i in range(k):
y = x.iloc[xs[i][1]].iloc[:,0]
X = x.iloc[xs[i][1]].iloc[:,range(1,dx)]
score_.append(np.mean(loss(X,y,J,h[i])))
return np.mean(score_)
# part 2: construct our estimator
# (comment steps corresponds to the CV algorithm introduced before, for a generalized k-fold)
# x= data, p= series large p, k is the k-fold,
def est(x,p,k):
# step 1: split the sample for cross-validation
kf = KFold(n_splits=k, shuffle=True)
xs = [(train, test) for train, test in kf.split(x)]
# xs is the KFold split index
# h will be a list of deepcopied trained models
# s will be a list of scores/associated empirical risk
h = [[] for _ in range(k)]
s = []
# total number of columns in the data
dx = x.shape[1]
# Y should always be in the first column
# step 2: for each training sample, creates a list trained models
# x.iloc[xs[i][0]] corresponding to the i-th training sample
# x.iloc[xs[i][1]] corresponding to the i-th testing sample
for i in range(k):
for j in range(p):
# i-th training sample Y values
y = x.iloc[xs[i][0]].iloc[:,0]
# i-th training sample X values
X = x.iloc[xs[i][0]].iloc[:,range(1,dx)]
# this is phi_j(y)
pjy = np.cos((np.pi)*j*y) if j < 1 else (2**0.5)*np.cos((np.pi)*j*y)
#MLP NeuroNet: this estimates E[phi_j(Y)|X] using MLP
#model = MLPRegressor(random_state=923, max_iter=200).fit(X,pjy)
#alternatively, consider random forest
model = RandomForestRegressor(max_depth=50, random_state=923).fit(X, pjy)
# deepcopy the trained model, will use for cross-validation
h[i].append(deepcopy(model))
# step 3: this creates a list of scores/average empirical risks from cross-validations
# each associated with a specific series cutoff
for J in range(1,p+1):
s.append(score(x,xs,J,h,k))
# step 5/6:
# find the cutoff associated with the smallest avg empirical risk;
# construct the final estimator by averaging through the estimator from each training sample.
def amcv0(x,y,s,h,k):
jhat = np.argmin(s)+1
f = 0
for i in range(k):
f = f+ fJ(x,y,jhat,h[i])
return f/k
# the end
return lambda x,y: amcv0(x,y,s,h,k)
The code is very simple and concise, and most importantly, it works! But there are some comments/future improvements that can be readily implemented.
Many tasks in step 2 are done in a sequential manner (for loops), but they can definitely run parallel! That would be a huge improvement in speed.
The machine learners should be further tuned in step 2. Additional cross-validations can be added for hyper-parameter tuning. This is another reason why we should improve the code to parallelize step 2.
To make the code more user-friendly, we can add choices for the basis and choices for the machine learners. This can be achieved by rewritten the codes in object-oriented manner.
Additional post-processing might be needed to make sure that the conditional density positive, especially near the boundary. We have a p-algorithm to do this, which actually further reduces the estimation error. See the codes in the approximate sparsity project for details.
## can simulate some data to try out the estimator to make sure it runs
# beta(2,5) distribution, 10000 observations, 100 covariates
# Xsim = np.random.beta(2,5, size = (10000,100))
# simulate the coefficients, assuming approximate sparsity
#B0 = []
#for i in range(100):
# B0.append(((i+1)**(-2)))
#B0 = np.asarray(B0)
# generate the outcome variable Y = exp(X'B + u)/(1+ exp(X'B + u)), u is standard normal noise
#sigmoid = lambda x: 1 / (1 + np.exp(-x))
#Ysim = sigmoid(Xsim@B0 + np.random.normal(0,1,10000))
#Ysim = pd.DataFrame(np.reshape(Ysim, (-1,1)))
#Ysim.columns=['y']
#Xsim = pd.DataFrame(Xsim)
#Xsim.columns =["X"+str(i) for i in range(100)]
#XYsim = pd.concat([Ysim, Xsim], axis = 1)
# ftry = est(XYsim,50,5) # the code runs just fine
## A sanity check to make sure that we actually have a density:
# ftry(XYsim.iloc[99,1:].values.reshape(1,-1),0.1)
# t = np.linspace(0,1,10000)
# y = ftry(XYsim.iloc[99,1:].values.reshape(1,-1),t) # for a single observation, has to use reshape(1,-1) to change the shape of the data
# plt.plot(t,y)
# plt.show()
# trydens = lambda t: ftry(XYsim.iloc[99,1:].values.reshape(1,-1),t)
# integrate.quad(trydens, 0, 1)