External control of gene regulatory networks (GRNs) has received much attention in recent years. The aim is to find a series of actions to apply to a gene regulation system making it avoid its diseased states. In this work, we propose a novel method for controlling partially observable GRNs combining batch mode reinforcement learning (Batch RL) and TD(lambda) algorithms. Unlike the existing studies inferring a computational model from gene expression data, and obtaining a control policy over the constructed model, our idea is to interpret the time series gene expression data as a sequence of observations that the system produced, and obtain an approximate stochastic policy directly from the gene expression data without estimation of the internal states of the partially observable environment. Thereby, we get rid of the most time consuming phases of the existing studies, inferring a model and running the model for the control. Results show that our method is able to provide control solutions for regulation systems of several thousands of genes only in seconds, whereas existing studies cannot solve control problems of even a few dozens of genes. Results also show that our approximate stochastic policies are almost as good as the policies generated by the existing studies.