I've published in a variety of fields. My interests loosely are statistical learning, randomized experimentation,
crowdsourcing, and biomedical applications. Please choose among the keywords below to sort by topic.
For a list of my citations, visit my Google Scholar page or my ResearchGate page.
statistical methodology
statistical "machine" learning
bayesian non-parametric learning
visualization
crowdsourcing
natural language processing
medical applications
medical theories
For a list of my citations, visit my Google Scholar page or my ResearchGate page.
Kapelner, A. & Krieger, A. (2014) Matching on-the-fly in Sequential Experiments for Higher Power and Efficiency. Biometrics, 70 (2), 378 - 388
(journal page)
(free PDF)
Imagine you are running a sequential experiment measuring the difference between a
treatment and control condition (e.g. a pill for blood pressure via a clinical trial
or testing user behavior in an Internet-based experiment). You can match similar
subjects together on-the-fly (as they arrive) to achieve higher power and efficiency
in the experimental results.
|
![]() |
Kapelner, A., Bleich, J., Cohen, Z. D., DeRubeis, R. J. & Berk, R. A. (2014) Inference for Treatment Regime Models in Personalized Medicine. submitted to Biometrics
(free PDF)
Imagine you are a medical practitioner treating a disease by prescribing
one of two possible drugs. Which drug do you assign to patients? Is your
special assignment procedure beneficial versus a naive random assignment?
How much better and is the improvement statistically significant?
|
![]() |
Kapelner, A. & Vorsanger, M. (2014) Starvation of Cancer via Induced Ketogenesis and Severe Hypoglycemia. in press, Medical Hypotheses
(journal page)
(free PDF)
It is well known that cancer cells are solely dependent on glucose as their substrate
for metabolism and they are not able to utilize other fuel sources such as ketones and
fatty acids. It is also known that humans under heavy ketosis do not experience symptoms
of hypoglycemia. In our proposal for cancer therapy, we marry these two ideas over the
long term.
|
![]() |
Kapelner, A. & Bleich, J. (2014) Prediction with Missing Data via Bayesian Additive Regression Trees. accepted, Canadian Journal of Statistics
(free PDF)
We develop an extension to Bayesian Additive Regression Trees (a new
procedure for statistically learning non-parametric functional relationship
between a set up input variables X and a response variable y). In the extension,
we incorporate missing data without the need for imputation. Simulations using
real data and generated models demonstrate high performance and stability over
competitors.
|
![]() |
Bleich, J., Kapelner, A., George, E. I. & Jensen, S. T. (2014) Variable Selection for BART: An Application to Gene Regulation. Annals of Applied Statistics, 8(3): 1750-1781
(journal page)
(free PDF)
We adapt Bayesian Additive Regression Trees (a new
procedure for statistically learning non-parametric functional relationship
between a set up input variables X and a response variable y). This adaptation can
select "important" variables from the set of x's that affect y by employing
a principled permutation-based inference procedure. We can also incorporate prior
information about which variables are thought to be important before looking at the data.
|
![]() |
Goldstein, A., Kapelner, A., Bleich, J. & Pitkin, E. (2014) Peeking Inside the Black Box: Visualizing Statistical Learning with Plots of Individual Conditional Expectation. in press, Journal of Computational & Graphical Statistics
(journal page)
(free PDF)
We develop a tool for visualizing the model estimated by any supervised "machine" learning algorithm. We
plot the variation of the fitted values across the range of a covariate of interest for all cases. These
lines suggest where and to what extent heterogeneities exist between cases. We also include a visual test
for model additivity in any covariate.
|
![]() |
Kapelner, A. & Bleich, J. (2014) bartMachine: A Powerful Tool for Machine Learning. accepted, Journal of Statistical Software
(free PDF)
We present a new R package implementation of Bayesian Additive Regression Trees (a new
procedure for statistically learning non-parametric functional relationship
between a set up input variables X and a response variable y). The package introduces
many new features for data analysis using BART such as variable selection, interaction
detection, model diagnostic plots, parallelization, incorporation of missing data and the ability to
save trees for future prediction.
|
![]() |
Bleich, J. & Kapelner, A. (2014) Bayesian Additive Regression Trees With Parametric Models of Heteroskedasticity. in revision, Bayesian Analysis
(free PDF)
We adapt Bayesian Additive Regression Trees (a new
procedure for statistically learning non-parametric functional relationship
between a set up input variables X and a response variable y). This adaptation
incorporates heteroskedasticity into the model by modeling the form of
heteroskedasticity as a linear model of another set of covariates (may or may
not be X). In simulations, we demonstrate a reduction in overfitting and
more appropriate predictive intervals than homoskedastic BART.
|
![]() |
Berk, R. A., Bleich, J., Kapelner, A., Henderson, J., Kurtz, E. (2014) Using Regression Kernels to Forecast A Failure to Appear in Court. submitted to Journal of Quantitative Criminology
We develop an implementation of principal components logistic
regression using a novel three split procedure. The first split
trains the kernel models from a set of many kernels, the second
picks the model which respects low error and correct error costs,
and the third gives an honest assessment of future performance.
Our implementation contains an R package and our paper applies
these methods to forecasting failures to appear in court.
|
![]() |
Chandler, D. & Kapelner, A. (2013) Breaking Monotony with Meaning: Motivation in Crowdsourcing Markets. Journal of Economic Behavior & Organization, 90: 123-133
(journal page)
(free PDF)
We conducted the first natural field experiment to explore the
relationship between the "meaningfulness" of a task and worker
effort, measured on three scales. We employed ~2500 workers from
Amazon's Mechanical Turk, an online labor market, to label medical
images. We manipulated the task to exhibit three conditions: "meaningful,"
"zero-context" and "shredded." We found that the meaningful treatment
increased the labor supply and the shredded treatment decreased the
quality of the labor.
|
![]() |
Schwartz, H. A., Eichstaedt, J., Blanco, E., Agrawal, M., Dziurzynnski, L., Kern, M. L., Kapelner, A., Park, G., Jha, S., Stillwell, D., Kosinski, M. & Ungar, L. H. (2014) Predicting People's Well-Being in Social Media: Multi-level message and user models of language use. working paper
We presented the task of predicting well-being, as measured by the "satisfaction with life scale."
Using Amazon's Mechanical Turk, we created a training set of textual examples properly rated. We then used
machine learning to build a high-performance model and in addition, we identify textual features that
characterize well-being.
|
![]() |
Kapelner, A., Kaliannan, K., Schwartz, H. A., Ungar, L. H. & Foster, D. P. (2012) New Insights from Coarse Word Sense Disambiguation in the Crowd. CoLING
(journal page)
We use crowdsourcing to disambiguate 1000 words from among coarse-grained senses.
Using regression, we find surprising features which drive differential WSD accuracy:
(a) the number of rephrasings within a sense definition is associated with higher accuracy;
(b) as word frequency increases, accuracy decreases even if the number of senses is
kept constant; and (c) spending more time is associated with a decrease in accuracy.
|
![]() |
Kapelner, A. & Chandler, D. (2010) Preventing Satisficing in Online Surveys. Proceedings of CrowdConf
(journal page)
We examine the prevalence of satisficing (mental shortcuts / cheating) on Amazon's
Mechanical Turk's survey tasks. We present a question-presentation
method of fading in survey questions and answers one-by-one, called "Kapcha," which we
experimentally demonstrate to reduce satisficing, thereby improving the quality of survey
results.
|
![]() |
Chang, A. Y., Bhattacharya, N., Mu, J., Setiadi, A. F., Carcamo-Cavazos, V., Lee, G. H., Simons, D. L., Yadegarynia, S., Hemati, K., Kapelner, A., Zheng, M., Krag, D. N., Schwartz, E. J., Chen, D. Z. & Lee, P. P. (2013) Spatial organization of dendritic cells within tumor draining lymph nodes impacts clinical outcome in breast cancer patients. Journal of translational medicine, 11(1): 242
(journal page)
We describe the spatial organization of dendritic cells within tumor-draining lymph nodes using the software gemident.
We then describe the spatial organization's association with survival outcome in cancer patients. We also characterize
specific changes in number, size, maturity, and T-cell co-localization of such clusters.
|
![]() |
Setiadi, A. F.; Ray, N. C., Kohrt, H. E., Kapelner, A., Carcamo-Cavazos, V., Levic, E. B., Yadegarynia, S., van der Loos, C. M., Schwartz, E. J., Holmes, S. & Lee, P. P. (2010) Quantitative, architectural analysis of immune cell subsets in tumor-draining lymph nodes from breast cancer patients and healthy lymph nodes. PloS one, 5(8): e12420
(journal page)
We present a novel, quantitative image analysis approach incorporating 1) multi-color
tissue staining, 2) high-resolution, automated whole-section imaging, 3) the use of the "gemident" image analysis software to identify cell
types and locations, and 4) spatial statistical analysis. We apply our integrative approach to compare the
architectural patterns of T and B cells within tumor-draining lymph nodes from breast cancer patients versus healthy lymph
nodes. We found that the spatial grouping patterns of T and B cells differed between healthy and breast cancer lymph
nodes, and this could be attributed to the lack of B cell localization in the extrafollicular region of the TDLNs.
|
![]() |
Holmes, S., Kapelner, A. & Lee, P. P. (2009) An interactive java statistical image segmentation system: Gemident. Journal of Statistical Software, 30(10): 1-20
(journal page)
We present a novel object identification algorithm developed in Java which locates objects of
interest in images. Here, we apply the system to finding cells in images of immunohisto-chemically-stained
lymph node tissue. The success of the method depends heavily on the use of color,
the relative homogeneity of object appearance, the user's input, and the coupled statistical learning
algorithm, random forests. Our system enables iterative improvements
to the classification over many correction cycles, resulting in a highly accurate and
user-friendly system.
|
![]() |
Kapelner, A., Lee, P. P. & Holmes, S. (2007) An interactive statistical image segmentation and visualization system. in proceedings of IEEE, Medical Information Visualisation
(journal page)
(free PDF)
Supervised learning can be used to segment
regions of interest in images making use of color and morphological
information. We developed a novel object identification algorithm in Java which locates
phenotypes of interest in images. Our main innovation is interactive feature
extraction from color images by using sums over color similarities (as measured
by the Mahalanobis distance) at various radii. These features are then fed into
a statistical learning algorithm to classify pixels belonging to phenotypes of interest.
|
![]() |