Jekyll2018-07-25T23:26:48-04:00https://willguxy.github.io/think saucetech notes
Willwillgu@yahoo.comBest practice for R in Docker2018-04-11T00:00:00-04:002018-04-11T00:00:00-04:00https://willguxy.github.io/2018/04/11/best-practice-for-r-in-docker<h2 id="docker">Docker</h2>
<p>I’ve talked about this topic in the past but decided to write another post with some practical ideas. To recap, containerization software <code class="highlighter-rouge">Docker</code> is a program that provisions and executes OS-level virtual machines. It runs within a system and gives the user the access to independent developing and run-time environments. For individual applications, <code class="highlighter-rouge">Docker</code> provides a VM solution that is isolated from the host OS to package them up with their dependencies and “ship” together. It makes application deployment very easy with just a few command lines. Comparing with server-level configurations, <code class="highlighter-rouge">Docker</code> allows multiple applications run on the same server without them affecting each other, and potentially communicate through the loopback network, without having to expose ports to other servers. The resource management. <code class="highlighter-rouge">Docker</code> can be used to build cluster on a single server by running multiple containers. Scaling is also made easy with <code class="highlighter-rouge">Docker</code> by using <code class="highlighter-rouge">docker-compose</code>. Server start-up and shut-down are usually slower than docker container up and down. <code class="highlighter-rouge">Docker</code> provisioning is also much simpler than that for server, which makes docker perfect for the scaling type of work.</p>
<h2 id="containers-for-r">Containers for R</h2>
<p>My trials with R containers in docker hub have been futile. The reason is that R doesn’t have many dependencies to begin with, especially with the <code class="highlighter-rouge">Ubuntu</code> base image. All you need is to install <code class="highlighter-rouge">r-base</code> and <code class="highlighter-rouge">r-base-dev</code>, although for other OS you might need a bit of work to install R. I mentioned last time that one way to instlal R packages is by using <code class="highlighter-rouge">Rscript -e "install.packages('your_awesome_package_name')"</code>, but just like Python, a lot of the R packages are built with C and rely on certain libraries. It’s easy to install those, but sometimes hard to wrap your head around what’s really needed without spending some time on Google.</p>
<p>An alternative and probably better way is to install these packages by <code class="highlighter-rouge">apt-get</code>. Some the common R packages all follow the same naming convention as <code class="highlighter-rouge">r-cran-xxx</code>. For example, <code class="highlighter-rouge">r-cran-ggplot2</code>, <code class="highlighter-rouge">r-cran-rpostgresql</code>. The reason why this is better is because the maintainers of these <code class="highlighter-rouge">apt</code> packages also specify dependencies, so that you don’t need to know that <code class="highlighter-rouge">RPostgreSQL</code> relies on <code class="highlighter-rouge">libpq-dev</code> – <code class="highlighter-rouge">apt-get</code> would install that for you.</p>
<p>Of course there are still a few exceptions. For example, the very awesome <code class="highlighter-rouge">data.table</code> package is not in <code class="highlighter-rouge">apt</code> as of now, so you’d need a bit of work to install that separately. What you could do, as I suggested last time, is to mimic what python does for dependencies. You can create a file called <code class="highlighter-rouge">requirements.txt</code> and it goes</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>data.table
</code></pre></div></div>
<p>and then you can have another R script called <code class="highlighter-rouge">install.R</code> with your favorite repo mirror server</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pkgs = read.csv('requirements.txt', header=FALSE, stringsAsFactors=FALSE)$V1
install.packages(pkgs, repos='http://cloud.r-project.org/')
</code></pre></div></div>
<p>and finally your <code class="highlighter-rouge">Dockerfile</code> can go like</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>from ubuntu:16.04
run apt-get update \
&& apt-get install -yqq --fix-missing \
&& apt-get upgrade -yqq \
&& apt-get install -yqq bsd-mailx \
&& apt-get install -yqq r-base r-base-dev \
&& apt-get install -yqq r-cran-rpostgresql r-cran-magrittr r-cran-gridextra \
&& apt-get autoremove
add requirements.txt /tmp/requirements.txt
add install.R /tmp/install.R
run cd /tmp && Rscript install.R
# some other stuff
</code></pre></div></div>
<p>Of course you can just throw the R package installation step in the <code class="highlighter-rouge">Dockerfile</code>, but I feel this way is clearer, and you can squash all installation into one step, which reduces unnessary layers when building the image.</p>
<p>Until nex time.</p>Willwillgu@yahoo.comDocker I’ve talked about this topic in the past but decided to write another post with some practical ideas. To recap, containerization software Docker is a program that provisions and executes OS-level virtual machines. It runs within a system and gives the user the access to independent developing and run-time environments. For individual applications, Docker provides a VM solution that is isolated from the host OS to package them up with their dependencies and “ship” together. It makes application deployment very easy with just a few command lines. Comparing with server-level configurations, Docker allows multiple applications run on the same server without them affecting each other, and potentially communicate through the loopback network, without having to expose ports to other servers. The resource management. Docker can be used to build cluster on a single server by running multiple containers. Scaling is also made easy with Docker by using docker-compose. Server start-up and shut-down are usually slower than docker container up and down. Docker provisioning is also much simpler than that for server, which makes docker perfect for the scaling type of work. Containers for R My trials with R containers in docker hub have been futile. The reason is that R doesn’t have many dependencies to begin with, especially with the Ubuntu base image. All you need is to install r-base and r-base-dev, although for other OS you might need a bit of work to install R. I mentioned last time that one way to instlal R packages is by using Rscript -e "install.packages('your_awesome_package_name')", but just like Python, a lot of the R packages are built with C and rely on certain libraries. It’s easy to install those, but sometimes hard to wrap your head around what’s really needed without spending some time on Google. An alternative and probably better way is to install these packages by apt-get. Some the common R packages all follow the same naming convention as r-cran-xxx. For example, r-cran-ggplot2, r-cran-rpostgresql. The reason why this is better is because the maintainers of these apt packages also specify dependencies, so that you don’t need to know that RPostgreSQL relies on libpq-dev – apt-get would install that for you. Of course there are still a few exceptions. For example, the very awesome data.table package is not in apt as of now, so you’d need a bit of work to install that separately. What you could do, as I suggested last time, is to mimic what python does for dependencies. You can create a file called requirements.txt and it goes data.table and then you can have another R script called install.R with your favorite repo mirror server pkgs = read.csv('requirements.txt', header=FALSE, stringsAsFactors=FALSE)$V1 install.packages(pkgs, repos='http://cloud.r-project.org/') and finally your Dockerfile can go like from ubuntu:16.04 run apt-get update \ && apt-get install -yqq --fix-missing \ && apt-get upgrade -yqq \ && apt-get install -yqq bsd-mailx \ && apt-get install -yqq r-base r-base-dev \ && apt-get install -yqq r-cran-rpostgresql r-cran-magrittr r-cran-gridextra \ && apt-get autoremove add requirements.txt /tmp/requirements.txt add install.R /tmp/install.R run cd /tmp && Rscript install.R # some other stuff Of course you can just throw the R package installation step in the Dockerfile, but I feel this way is clearer, and you can squash all installation into one step, which reduces unnessary layers when building the image. Until nex time.Trading Systhem with Python and Redis (toy model)2018-03-06T00:00:00-05:002018-03-06T00:00:00-05:00https://willguxy.github.io/2018/03/06/trading-system-with-python-and-redis<h2 id="basic-idea">Basic Idea</h2>
<p>While getting more interested nowadays in trading system design, I want to share today what I’ve been thinking and implementing so far, as basis of entering the grand regime of the quant trading world. In previous post, I said that simple models are often better models. That usually makes the implementation of model less big of a deal. Therefore, building a robust and scalable trading system would be the more fun part. In this post, I’ll demonstrate with simple examples as prroof of concept, and illustrate the thought process I have regarding building a trading system from ground up.</p>
<p>The basic components of a tranding system includes but not limited to: raw data feed, data ETL, signal generation, order generation, order execution, account management, risk management, P&L component, data persistence, etc. Some of the component can be thought of multiple cascades of processing with potentional branches if, say, we have multiple signals. One modern way of thinking about real-time system is based on event-driven programming. At the occurence of an event, certain process gets triggered and passed over the subsequent processes. In other times, these processes remains idle. This naturally creates a asynchronous pipeline, which throughput is limited by the slowest component. The good news is that as long as all events/data are somewhat independent and can be processed in parallel, we can always throw in more “workers” to the bottleneck component. As long as the most limited component has higher throughput than the original input, the system would not see any congestion.</p>
<p>The communication between components are mostly pre-defined structured data. Hence, naturally it makes sense to have some unified API between all components. This way, it is a lot easier to build things just against the API. All you need to know is how to connnect to the standard API, what data you are getting, and what kind of data you should be sending. The data you are getting is probably determined by the upstream components and aggreed by all parties globally schema-wise, similarly for the kind of data you are sending out.</p>
<p>The above pattern is called publisher-subscriber pattern. You might’ve heard observer/observable pattern, which is somewhat similar. The nuance difference is the exact reason why Redis is used. Pub/Sub pattern are composed of two relatively independent component, whose message flow is entirely controlled by the central message bus. These two components don’t necessarily know the existence of one another. That’s to say, the publisher can keep publishing “into the air”, while the subscriber can subsribe to something that has no news. On the other hand, the subject in the observer pattern has to maintain the event loop and maintain all information of the observers, similarly observers are aware of the existence of the subject, thus hooking up itself to the subject. The pub/sub is more loose in terms of component coupling, but central message bus is required in implementation.</p>
<h2 id="system-structure">System Structure</h2>
<p>Usually message bus is pain in the butt. With Redis, we now have a lightweight central message bus, with very limited functionality comparing with other full-fledged message buses (Kafka, RabbitMQ, ActiveMQ, ZeroMQ, etc), but enough for a toy model. The basic pattern for each component is (in Python):</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cache = redis.StrictRedis(...)
p = cache.pubsub()
p.subscribe('some_topic')
for msg in p.listen();
# do something with msg
# msg: type, pattern, channel, data
p.publish('some_other_topic', some_new_data)
</code></pre></div></div>
<p>The for-loop is a infinite loop, which will not exit when there’s no new message in the <code class="highlighter-rouge">some_topic</code> it subscribes to. Now you can use the same pattern to write a few components and generate orders. Note that the raw data feed also needs to run continuously. One can make intermittent calls to REST API with the following pseudo-code</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>while True:
call REST API for data
publish('some_topic', 'some_data')
sleep(1)
</code></pre></div></div>
<p>However, the data is only avaible when we call API for it, and the overhead around http request/reponse leads to higher latency. In quant trading, we want to be more real-time and have less latency. Thus, other data transferring channels are preferred, such as the WebSocket connection (which a lot of exchanges provide). What’s better, the socket connection is naturally event-driven, seamlessly integrated into our existing pipeline. So a better way of implementing raw data feed is</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>data_feed = websocket(some_url, some_authentication)
for msg in data_feed.listen():
# turn raw data into internal data schema
publish('some_topic', 'some_data')
</code></pre></div></div>
<p>Simple as this. Until nex time.</p>Willwillgu@yahoo.comBasic Idea While getting more interested nowadays in trading system design, I want to share today what I’ve been thinking and implementing so far, as basis of entering the grand regime of the quant trading world. In previous post, I said that simple models are often better models. That usually makes the implementation of model less big of a deal. Therefore, building a robust and scalable trading system would be the more fun part. In this post, I’ll demonstrate with simple examples as prroof of concept, and illustrate the thought process I have regarding building a trading system from ground up. The basic components of a tranding system includes but not limited to: raw data feed, data ETL, signal generation, order generation, order execution, account management, risk management, P&L component, data persistence, etc. Some of the component can be thought of multiple cascades of processing with potentional branches if, say, we have multiple signals. One modern way of thinking about real-time system is based on event-driven programming. At the occurence of an event, certain process gets triggered and passed over the subsequent processes. In other times, these processes remains idle. This naturally creates a asynchronous pipeline, which throughput is limited by the slowest component. The good news is that as long as all events/data are somewhat independent and can be processed in parallel, we can always throw in more “workers” to the bottleneck component. As long as the most limited component has higher throughput than the original input, the system would not see any congestion. The communication between components are mostly pre-defined structured data. Hence, naturally it makes sense to have some unified API between all components. This way, it is a lot easier to build things just against the API. All you need to know is how to connnect to the standard API, what data you are getting, and what kind of data you should be sending. The data you are getting is probably determined by the upstream components and aggreed by all parties globally schema-wise, similarly for the kind of data you are sending out. The above pattern is called publisher-subscriber pattern. You might’ve heard observer/observable pattern, which is somewhat similar. The nuance difference is the exact reason why Redis is used. Pub/Sub pattern are composed of two relatively independent component, whose message flow is entirely controlled by the central message bus. These two components don’t necessarily know the existence of one another. That’s to say, the publisher can keep publishing “into the air”, while the subscriber can subsribe to something that has no news. On the other hand, the subject in the observer pattern has to maintain the event loop and maintain all information of the observers, similarly observers are aware of the existence of the subject, thus hooking up itself to the subject. The pub/sub is more loose in terms of component coupling, but central message bus is required in implementation. System Structure Usually message bus is pain in the butt. With Redis, we now have a lightweight central message bus, with very limited functionality comparing with other full-fledged message buses (Kafka, RabbitMQ, ActiveMQ, ZeroMQ, etc), but enough for a toy model. The basic pattern for each component is (in Python): cache = redis.StrictRedis(...) p = cache.pubsub() p.subscribe('some_topic') for msg in p.listen(); # do something with msg # msg: type, pattern, channel, data p.publish('some_other_topic', some_new_data) The for-loop is a infinite loop, which will not exit when there’s no new message in the some_topic it subscribes to. Now you can use the same pattern to write a few components and generate orders. Note that the raw data feed also needs to run continuously. One can make intermittent calls to REST API with the following pseudo-code while True: call REST API for data publish('some_topic', 'some_data') sleep(1) However, the data is only avaible when we call API for it, and the overhead around http request/reponse leads to higher latency. In quant trading, we want to be more real-time and have less latency. Thus, other data transferring channels are preferred, such as the WebSocket connection (which a lot of exchanges provide). What’s better, the socket connection is naturally event-driven, seamlessly integrated into our existing pipeline. So a better way of implementing raw data feed is data_feed = websocket(some_url, some_authentication) for msg in data_feed.listen(): # turn raw data into internal data schema publish('some_topic', 'some_data') Simple as this. Until nex time.R the Docker way2018-03-02T00:00:00-05:002018-03-02T00:00:00-05:00https://willguxy.github.io/2018/03/02/r-the-docker-way<h2 id="docker">Docker</h2>
<p>The idea of containerized application/process has been thriving for the past few years. Basically it’s a separation of concerns – you start from scratch, install things you need for you application, and only need to worry about the local env within the container. You can build your Docker image (the VM) incrementally with version control. Deployment made easy and co-exist of all sorts of applications on one host machine without having to worry about dependencies and whatnot.</p>
<p>Of course, container management can be challenging once scaled-up, and there are more things on top (like Kubernetes) to facilitate the process. People kinda favor dynamic linking traditionally, and this sort goes back to the origin of Docker. But version problem is almost inevitable in dynamic linking. Python for example, is a type of VM as well (obviously virtual env tells the truth). But poeple tend to favor Docker nowadays to virtual env, because Docker is lot easier to manage and will never mess up with your host machine.</p>
<p>Other people may favor static-linking, which is also easy for deployment, and you don’t have to worry about version problem! You don’t need all the overhead and VM, which can be huge comparing with just a lightweight application (Docker image of gigabytes isn’t rare). After all, do you really need all the Ubuntu systems flying around when your application just send you a reminder every day to drink more water?</p>
<h2 id="r">R</h2>
<p>Enough on Docker. I’ve been getting all different opinions on R and Python – the most known rivals in data science world. The reality is, which one suits you better depends on your own level of knowledge and your goal with coding something up. R in my opinion is more functional programming, where Python is more object-oriented. You may argue that both can do the other. But the fundmental philosiphy is indeed different. You can tell from the very common <code class="highlighter-rouge">head</code> function in R. It’s generic and gives you the first couple of elements from any iteratable data structure. While in Python, it’s a class method. Of course, Python has <code class="highlighter-rouge">len</code>, <code class="highlighter-rouge">map</code>, <code class="highlighter-rouge">reduce</code>, etc. But that’s really not the focus of Python. In R, this sort of functions are everywhere.</p>
<p>So that makes R a language for people who are familiar with the concept and don’t care too much about implementation (like people with stats background). Sloppy name space is OK. Load whatever package you need. If you are thinking about autocorrelcation, fine, just call <code class="highlighter-rouge">acf</code>, and boom, magic. All top level. For data work, you always think about data when coding things up. Simple as that. Of course, the learning curve could be a bit steep. But once you master a few R packages, you can just stick with them for the rest of your life. Python on the other hand, needs a bit more specification in terms of where the function gets loaded. Pandas DataFrame is unfortunately verbose comparing with R <code class="highlighter-rouge">data.table</code> etc.</p>
<p>However, Python is lot more handy when you develop some sort of process. The coding is more systematic and handles very well with all sorts of resources, e.g. <code class="highlighter-rouge">Flask</code> for web framework. R can do some, but is not really designed for that.</p>
<h2 id="r-for-docker">R for Docker</h2>
<p>Although R is not designed for full-fledge applications, it can be really useful in data science, because a lot of the work is just data ETL (Extract, Transform, and Load). R is succcinct and efficient. R library is either compiled c executables or just more R scripts. But the dependency could still be an issue, especially when you have all sorts of R processes. Docker in this case could be handy. I originally started with pure Ubuntu and tried to install R as I build the image. But with more understading in Docker, I realized that one should use existing Docker images for stability and compatibility (such as r-base on Docker hub).</p>
<p>Installing packages can be a bit cumbersome. Having tens of lines of <code class="highlighter-rouge">RUN Rscript -e "install.packages('ggplot2')"</code> in your Dockerfile is far from ideal. I’d suggest adopting the Python way, like <code class="highlighter-rouge">requirementsR.txt</code> with package names. And a simple R scipt to lead this file and install packages. Some R packages have dependencies which need to be installed with <code class="highlighter-rouge">apt-get</code>, but I think some Python modules have the same issue when you install through <code class="highlighter-rouge">pip</code>.</p>
<p>With Python it’s naturally intuitive to use separate files/modules for different part of the code. Better reusability, things are put into perspectives. With R, I originally tried put all of my code into functions, and just throw all the function calls into a main function. But overtime, I feel that a lot of the ETL work is just hard to reuse. A lot of it is data specific, and requirements you to customize your code a little bit. Even though somethings are done in almost every ETL pipeline (like fill/remove missing data), it’s only one or two lines. No point really to throw them into a function and try to be generic. Therefore, most of my R script is just straight top-to-bottom codes, not much function calls, ver few indents.</p>
<p>It’s helpful though, to have some data validation steps, intermediate output, and <code class="highlighter-rouge">tryCatch</code>, so that debugs are a lot easier.</p>
<h2 id="summary">Summary</h2>
<p>Overall I think using containerized R process for data ETL work is great. With kubernetes, you can even schedule dockers and make the pipeline vert efficient</p>Willwillgu@yahoo.comDocker The idea of containerized application/process has been thriving for the past few years. Basically it’s a separation of concerns – you start from scratch, install things you need for you application, and only need to worry about the local env within the container. You can build your Docker image (the VM) incrementally with version control. Deployment made easy and co-exist of all sorts of applications on one host machine without having to worry about dependencies and whatnot. Of course, container management can be challenging once scaled-up, and there are more things on top (like Kubernetes) to facilitate the process. People kinda favor dynamic linking traditionally, and this sort goes back to the origin of Docker. But version problem is almost inevitable in dynamic linking. Python for example, is a type of VM as well (obviously virtual env tells the truth). But poeple tend to favor Docker nowadays to virtual env, because Docker is lot easier to manage and will never mess up with your host machine. Other people may favor static-linking, which is also easy for deployment, and you don’t have to worry about version problem! You don’t need all the overhead and VM, which can be huge comparing with just a lightweight application (Docker image of gigabytes isn’t rare). After all, do you really need all the Ubuntu systems flying around when your application just send you a reminder every day to drink more water? R Enough on Docker. I’ve been getting all different opinions on R and Python – the most known rivals in data science world. The reality is, which one suits you better depends on your own level of knowledge and your goal with coding something up. R in my opinion is more functional programming, where Python is more object-oriented. You may argue that both can do the other. But the fundmental philosiphy is indeed different. You can tell from the very common head function in R. It’s generic and gives you the first couple of elements from any iteratable data structure. While in Python, it’s a class method. Of course, Python has len, map, reduce, etc. But that’s really not the focus of Python. In R, this sort of functions are everywhere. So that makes R a language for people who are familiar with the concept and don’t care too much about implementation (like people with stats background). Sloppy name space is OK. Load whatever package you need. If you are thinking about autocorrelcation, fine, just call acf, and boom, magic. All top level. For data work, you always think about data when coding things up. Simple as that. Of course, the learning curve could be a bit steep. But once you master a few R packages, you can just stick with them for the rest of your life. Python on the other hand, needs a bit more specification in terms of where the function gets loaded. Pandas DataFrame is unfortunately verbose comparing with R data.table etc. However, Python is lot more handy when you develop some sort of process. The coding is more systematic and handles very well with all sorts of resources, e.g. Flask for web framework. R can do some, but is not really designed for that. R for Docker Although R is not designed for full-fledge applications, it can be really useful in data science, because a lot of the work is just data ETL (Extract, Transform, and Load). R is succcinct and efficient. R library is either compiled c executables or just more R scripts. But the dependency could still be an issue, especially when you have all sorts of R processes. Docker in this case could be handy. I originally started with pure Ubuntu and tried to install R as I build the image. But with more understading in Docker, I realized that one should use existing Docker images for stability and compatibility (such as r-base on Docker hub). Installing packages can be a bit cumbersome. Having tens of lines of RUN Rscript -e "install.packages('ggplot2')" in your Dockerfile is far from ideal. I’d suggest adopting the Python way, like requirementsR.txt with package names. And a simple R scipt to lead this file and install packages. Some R packages have dependencies which need to be installed with apt-get, but I think some Python modules have the same issue when you install through pip. With Python it’s naturally intuitive to use separate files/modules for different part of the code. Better reusability, things are put into perspectives. With R, I originally tried put all of my code into functions, and just throw all the function calls into a main function. But overtime, I feel that a lot of the ETL work is just hard to reuse. A lot of it is data specific, and requirements you to customize your code a little bit. Even though somethings are done in almost every ETL pipeline (like fill/remove missing data), it’s only one or two lines. No point really to throw them into a function and try to be generic. Therefore, most of my R script is just straight top-to-bottom codes, not much function calls, ver few indents. It’s helpful though, to have some data validation steps, intermediate output, and tryCatch, so that debugs are a lot easier. Summary Overall I think using containerized R process for data ETL work is great. With kubernetes, you can even schedule dockers and make the pipeline vert efficientOverfitting problem in quantitative trading strategies2018-02-28T00:00:00-05:002018-02-28T00:00:00-05:00https://willguxy.github.io/2018/02/28/strategy-backtest-overfitting<p>The problem of overfitting is ubiquitous. In the world of quantitative trading, its impact can be devastating. On the modest side, the signal degredates very fast; on the extreme side, you lose money. Overfitting is the most severe when you optimize your model parameters by maximizing some kind of objective function, no matter if it’s gain, sharpe ratio, least drawdown etc., using the historical price time series. After all, by looping through the same price series for enough times, you ultimately would end up with some strategy that works perfectly. But out-of-sample performance is doomed to be bad-looking. Things get even worse when the model has all sorts of tuning parameters.</p>
<p>By the way, the very famous <a href="https://en.wikipedia.org/wiki/Curse_of_dimensionality">“curve of dimensionality”</a> already tells us that the among of data to get robust prediction grow exponentially with the increase of dimension. So, simple strategies are often better strategies. Anyways, even with simple model, my personal rule-of-thumb is that, if the optimized value is overly sensitive to the choice of parameter around its backtested “optimal”, then it’s probably not good. On the other hand, if the jiggled parameters lead to similar results, then the parameters are more likely to thrive.</p>
<p>We also know that cross-validation is an awesome way to reduce the risk of overfitting. I’ll probably talk about how I think it can be used in quantitative stratigies in another post. But for now, I’d like to introduce you to an awesome <a href="http://epchan.blogspot.com/2017/11/optimizing-trading-strategies-without.html">blog post</a> by Ernest Chan, the author of a few well-known quant trading books, that talks about their stab at the overfitting problem. To put into simple terms, one can backtest on simulated price series and optimize over all possible outcomes. The simulated price comes from a good time series model of the price history, which probably has to have some unique characteristics that can be used. After all, one cannot have a wining strategy over pure random walks. This approach is kind of like data agumentation, whic I argue that time series model based simulation is much better than resampling. The latter approach is used widely in monte carlo simulation (according to my professor of risk management).</p>
<p>This of course relies heavily on a good (and stable) time series model of the price history. I cannot speak from that perpesctive if that’s realistic or not, since I’m not an expert of time series. This blog post also mentioned two papers by <a href="https://arxiv.org/pdf/1411.5062.pdf">Leung</a> and <a href="https://arxiv.org/pdf/1408.1159.pdf">Karr</a> respectively on this subject. The Karr paper is said to be similar to what Ernest described in his blog post, although he claimed that they discovered this approach indenpendently.</p>
<p>Also as a heads-up, I’ll probably write another post on my recently developed trading system based on python and redis. This is system is designed to run both in real-time and as backtesting, so they they utilize the same infrastructure and code base. The system is also designed for scalability, in the sense of add more assets, exchanges, and components, though the performance would probably see a bottle neck for higher frequency and more complex processes. I’ve started learning node.js, which I’ve been told good things about.</p>
<p>Until next time.</p>Willwillgu@yahoo.comThe problem of overfitting is ubiquitous. In the world of quantitative trading, its impact can be devastating. On the modest side, the signal degredates very fast; on the extreme side, you lose money. Overfitting is the most severe when you optimize your model parameters by maximizing some kind of objective function, no matter if it’s gain, sharpe ratio, least drawdown etc., using the historical price time series. After all, by looping through the same price series for enough times, you ultimately would end up with some strategy that works perfectly. But out-of-sample performance is doomed to be bad-looking. Things get even worse when the model has all sorts of tuning parameters. By the way, the very famous “curve of dimensionality” already tells us that the among of data to get robust prediction grow exponentially with the increase of dimension. So, simple strategies are often better strategies. Anyways, even with simple model, my personal rule-of-thumb is that, if the optimized value is overly sensitive to the choice of parameter around its backtested “optimal”, then it’s probably not good. On the other hand, if the jiggled parameters lead to similar results, then the parameters are more likely to thrive. We also know that cross-validation is an awesome way to reduce the risk of overfitting. I’ll probably talk about how I think it can be used in quantitative stratigies in another post. But for now, I’d like to introduce you to an awesome blog post by Ernest Chan, the author of a few well-known quant trading books, that talks about their stab at the overfitting problem. To put into simple terms, one can backtest on simulated price series and optimize over all possible outcomes. The simulated price comes from a good time series model of the price history, which probably has to have some unique characteristics that can be used. After all, one cannot have a wining strategy over pure random walks. This approach is kind of like data agumentation, whic I argue that time series model based simulation is much better than resampling. The latter approach is used widely in monte carlo simulation (according to my professor of risk management). This of course relies heavily on a good (and stable) time series model of the price history. I cannot speak from that perpesctive if that’s realistic or not, since I’m not an expert of time series. This blog post also mentioned two papers by Leung and Karr respectively on this subject. The Karr paper is said to be similar to what Ernest described in his blog post, although he claimed that they discovered this approach indenpendently. Also as a heads-up, I’ll probably write another post on my recently developed trading system based on python and redis. This is system is designed to run both in real-time and as backtesting, so they they utilize the same infrastructure and code base. The system is also designed for scalability, in the sense of add more assets, exchanges, and components, though the performance would probably see a bottle neck for higher frequency and more complex processes. I’ve started learning node.js, which I’ve been told good things about. Until next time.Mac/Ubuntu下终端色彩主题设置2017-12-16T00:00:00-05:002017-12-16T00:00:00-05:00https://willguxy.github.io/2017/12/16/coding-appearance-setup<p>审美是主观的, 但是总有一些东西是大家普遍觉得更”美”的. 我自己由于工作性质和个人爱好两方面的原因, 平时大部分时间都是对着Mac或者Ubuntu下的Terminal做事情. 从最基本的bash脚本, 到工作上用到的R, Python等等, 都离不开终端窗口和Vim. 甚至我的个人博客, 都是建立在github上的, 所有的markdown格式我都使用Vim编辑. 既然每天对着的东西, 肯定希望有一个漂亮的界面, 让自己身心愉悦.</p>
<p>所以这篇博客, 就总结一下自己长期以来为了让界面更漂亮, 而使用的一些设置技巧和工具.</p>
<p>首先抛弃了Ubuntu和Mac下的默认Terminal, 而更换为Terminator(Ubuntu)或者iTerm2(Mac). 其中Terminator可以直接通过apt安装. 从版本上来说我用的是0.98版, 已经很久没有更新过了. 看到有人folk了前面的版本并且在做更新, 但还没有加入到Ubuntu的默认ppa列表里. 暂时0.98也够用了, 在Ubuntu 16.04的系统下还没出现过什么大问题. 而iTerm2只需要搜索一下就能找到下载安装的方法.</p>
<p>后面的步骤两个不同的系统就可以基本通用了. 下载好之后需要把默认的bash换成zsh, 而且最好是<a href="https://github.com/robbyrussell/oh-my-zsh">oh-my-zsh</a>. 如果担心zsh和bash之间的区别, 可以网上查一下可能的caveat. 对我个人的工作来说, 还没有发现明显的不同. 安装好以后, 可以在<code class="highlighter-rouge">$HOME/.zshrc</code>文件里改动主题设置和插件等等. 同时检查一下自己的<code class="highlighter-rouge">$PATH</code>是不是还正确, 以及设置的各种alias等等. 我Mac下选择的主题是<code class="highlighter-rouge">ys</code>, 而Ubuntu下是<code class="highlighter-rouge">agnoster</code>. 两个在我看来都很好. <code class="highlighter-rouge">agnoster</code>显示的稍微紧凑一点, 因为行数更少.</p>
<p>接下来就需要改变色彩主题了. 我们克隆<a href="https://github.com/mbadolato/iTerm2-Color-Schemes">iterm2-color-theme</a>这个github repo. 里面对于iTerm2和Terminator都有分别的设置. 按照github repo的说明文档, iTerm2可以在<code class="highlighter-rouge">设置</code>-><code class="highlighter-rouge">Profiles</code>-><code class="highlighter-rouge">Color</code>里选择<code class="highlighter-rouge">Color Presets</code>, 然后导入你想要的<code class="highlighter-rouge">.itermcolors</code>文件. Terminator更加简单, 只需要复制你喜欢的主题对应的<code class="highlighter-rouge">.confg</code>文件里的代码, 然后放在你Terminator的设置文件里即可. 这里记得同事改变layout设置下的默认主题. 两个系统下我都选择了Dracula(德古拉)这个深色的主题. 这里面还有很多选择, 建议大家多做尝试.</p>
<p>字体的选择也很重要. Ubuntu下默认的Mono字体就很不错了, 但我还是选择了用<a href="https://github.com/powerline/fonts">Powerline</a>系列的字体, 字符的过渡部分更加圆滑. 大家可以遵照github repo里面的说明, 安装即可. 字体安装好之后, 在终端设置里找Powerline系列的字体. 我自己选了DejaVu Sans Mono for Powerline.</p>
<p>最后Vim的色彩主题选择, 只需要克隆<a href="https://github.com/sjl/badwolf">badwolf</a>这个repo到本地, 然后把colors目录下的<code class="highlighter-rouge">.vim</code>文件放到<code class="highlighter-rouge">$HOME/.vim/colors/</code>目录下就可以了. 也可以创建symbolic link, 这样repo有更新的话, 你的vim主题也能得到更新. 最后一步在<code class="highlighter-rouge">$HOME/.vimrc</code>文件里加入或者修改<code class="highlighter-rouge">colorscheme badwolf</code>或者<code class="highlighter-rouge">colorscheme goodwolf</code>. 其中badwolf是深色主题, goodwolf是浅色主题</p>Willwillgu@yahoo.com审美是主观的, 但是总有一些东西是大家普遍觉得更”美”的. 我自己由于工作性质和个人爱好两方面的原因, 平时大部分时间都是对着Mac或者Ubuntu下的Terminal做事情. 从最基本的bash脚本, 到工作上用到的R, Python等等, 都离不开终端窗口和Vim. 甚至我的个人博客, 都是建立在github上的, 所有的markdown格式我都使用Vim编辑. 既然每天对着的东西, 肯定希望有一个漂亮的界面, 让自己身心愉悦. 所以这篇博客, 就总结一下自己长期以来为了让界面更漂亮, 而使用的一些设置技巧和工具. 首先抛弃了Ubuntu和Mac下的默认Terminal, 而更换为Terminator(Ubuntu)或者iTerm2(Mac). 其中Terminator可以直接通过apt安装. 从版本上来说我用的是0.98版, 已经很久没有更新过了. 看到有人folk了前面的版本并且在做更新, 但还没有加入到Ubuntu的默认ppa列表里. 暂时0.98也够用了, 在Ubuntu 16.04的系统下还没出现过什么大问题. 而iTerm2只需要搜索一下就能找到下载安装的方法. 后面的步骤两个不同的系统就可以基本通用了. 下载好之后需要把默认的bash换成zsh, 而且最好是oh-my-zsh. 如果担心zsh和bash之间的区别, 可以网上查一下可能的caveat. 对我个人的工作来说, 还没有发现明显的不同. 安装好以后, 可以在$HOME/.zshrc文件里改动主题设置和插件等等. 同时检查一下自己的$PATH是不是还正确, 以及设置的各种alias等等. 我Mac下选择的主题是ys, 而Ubuntu下是agnoster. 两个在我看来都很好. agnoster显示的稍微紧凑一点, 因为行数更少. 接下来就需要改变色彩主题了. 我们克隆iterm2-color-theme这个github repo. 里面对于iTerm2和Terminator都有分别的设置. 按照github repo的说明文档, iTerm2可以在设置->Profiles->Color里选择Color Presets, 然后导入你想要的.itermcolors文件. Terminator更加简单, 只需要复制你喜欢的主题对应的.confg文件里的代码, 然后放在你Terminator的设置文件里即可. 这里记得同事改变layout设置下的默认主题. 两个系统下我都选择了Dracula(德古拉)这个深色的主题. 这里面还有很多选择, 建议大家多做尝试. 字体的选择也很重要. Ubuntu下默认的Mono字体就很不错了, 但我还是选择了用Powerline系列的字体, 字符的过渡部分更加圆滑. 大家可以遵照github repo里面的说明, 安装即可. 字体安装好之后, 在终端设置里找Powerline系列的字体. 我自己选了DejaVu Sans Mono for Powerline. 最后Vim的色彩主题选择, 只需要克隆badwolf这个repo到本地, 然后把colors目录下的.vim文件放到$HOME/.vim/colors/目录下就可以了. 也可以创建symbolic link, 这样repo有更新的话, 你的vim主题也能得到更新. 最后一步在$HOME/.vimrc文件里加入或者修改colorscheme badwolf或者colorscheme goodwolf. 其中badwolf是深色主题, goodwolf是浅色主题A few github repos for natural language to sql queries2017-11-24T00:00:00-05:002017-11-24T00:00:00-05:00https://willguxy.github.io/2017/11/24/github-repos-nlp-sql<p>I realize that turning natural language to sql queries isn’t what nlp is all about, but it’s an interesting sub-problem. It has some constraints – for example, it expects the input to be questions related to data. On the other hand, database has to exist for your question, even some of the weirdiest ones. To that extent, nlp invoking search engines might be a better solution, and your search engine is reposible to making sense of the results. Google apparently has deployed some experimental features like this.</p>
<p>In other scenarios, nlp to sql can actually be quite useful. For example, if you constrain your user base to ask questions on sepecific data, like “<em>what’s the most expensive restaurant within 25 miles of where I live?</em>” – regardless of whether you really want to put your bucks on those or just out of curiosity. It’s a lot easier to make sense of, and the accuracy can potentially be pretty good.</p>
<p>Salesforce has annouced recently that by year 2020, they will come up with some tool for their users to query data using natural language. The model? LSTM, value networks, that kinda of thing. I believe that they use those machine learning algos for a reason, but just for your knowledge, Microsoft used to provide this feature for their MS SQL products back in the years, but discontinued that project later. So the idea is definitely not new. I do believe though, that Salesforce has done their own research and decided that LSTM is the way to go.</p>
<p>While the above mentioned methods by Salesforce are pretty much in-house, I’ve found a few github repos that attempt to attack similar problems if not the same, which could good staring points if you decide to probe this problem a bit deeper:</p>
<ul>
<li><a href="https://github.com/FerreroJeremy/ln2sql">In2sql</a></li>
<li><a href="https://github.com/vqtran/EchoQuery">EchoQuery</a></li>
<li><a href="https://github.com/machinalis/quepy">quepy</a></li>
<li><a href="https://github.com/Anishabhatla281/Natural-Language-to-SQL-Convertor">Natural-Language-to-SQL-Convertor</a></li>
<li><a href="https://github.com/nihit7/NLIDB">NLIDB</a></li>
<li><a href="https://github.com/DukeNLIDB/NLIDB">DukeNLIDB</a></li>
</ul>Willwillgu@yahoo.comI realize that turning natural language to sql queries isn’t what nlp is all about, but it’s an interesting sub-problem. It has some constraints – for example, it expects the input to be questions related to data. On the other hand, database has to exist for your question, even some of the weirdiest ones. To that extent, nlp invoking search engines might be a better solution, and your search engine is reposible to making sense of the results. Google apparently has deployed some experimental features like this. In other scenarios, nlp to sql can actually be quite useful. For example, if you constrain your user base to ask questions on sepecific data, like “what’s the most expensive restaurant within 25 miles of where I live?” – regardless of whether you really want to put your bucks on those or just out of curiosity. It’s a lot easier to make sense of, and the accuracy can potentially be pretty good. Salesforce has annouced recently that by year 2020, they will come up with some tool for their users to query data using natural language. The model? LSTM, value networks, that kinda of thing. I believe that they use those machine learning algos for a reason, but just for your knowledge, Microsoft used to provide this feature for their MS SQL products back in the years, but discontinued that project later. So the idea is definitely not new. I do believe though, that Salesforce has done their own research and decided that LSTM is the way to go. While the above mentioned methods by Salesforce are pretty much in-house, I’ve found a few github repos that attempt to attack similar problems if not the same, which could good staring points if you decide to probe this problem a bit deeper: In2sql EchoQuery quepy Natural-Language-to-SQL-Convertor NLIDB DukeNLIDBTurn your wordpress.com blog into Jekyll2017-11-23T00:00:00-05:002017-11-23T00:00:00-05:00https://willguxy.github.io/2017/11/23/wordpress-xml-to-markdown<p>I found this <a href="https://gist.github.com/brianburridge/d28fd59ecd097c140be2">Github Gist</a> pretty useful. After exporting the xml file from wordpress admin account, you can call the <code class="highlighter-rouge">wordpressxml2jekyll.rb</code> like</p>
<figure class="highlight"><pre><code class="language-bash" data-lang="bash">ruby wordpressxml2jekyll.rb wordpress.xml</code></pre></figure>
<p>Then you should be able to find a folder named <code class="highlighter-rouge">_posts</code> under your current directory. I have to admit it’s not perfect and you’d be better off using wordpress plugins such as <a href="https://wordpress.org/plugins/jekyll-exporter">Jekyll Exporter</a>. It would work if you have wordpress blog fully managed by you, but would fail in case you are using the free version on wordpress.com</p>
<p>Trust me, don’t even bother trying the method mentioned on jekyll’s website – it’s not going to work and the output markdown files are pretty much trash. <em>pandoc</em> didn’t work for me either.</p>Willwillgu@yahoo.comI found this Github Gist pretty useful. After exporting the xml file from wordpress admin account, you can call the wordpressxml2jekyll.rb like ruby wordpressxml2jekyll.rb wordpress.xml Then you should be able to find a folder named _posts under your current directory. I have to admit it’s not perfect and you’d be better off using wordpress plugins such as Jekyll Exporter. It would work if you have wordpress blog fully managed by you, but would fail in case you are using the free version on wordpress.com Trust me, don’t even bother trying the method mentioned on jekyll’s website – it’s not going to work and the output markdown files are pretty much trash. pandoc didn’t work for me either.