coding beacon

[programming & visualization]

Tag Archives: data mining

Building a Production Machine Learning Infrastructure

The following is a presentation by Josh Willis. Josh provides a rare, no-nonsense view on the field of data science.

The “Data Science Workflow” Screenshot



Open Source Data Mining Software: Orange

The website ( describes Orange as follows: “Open source data visualization and analysis for novice and experts. Data mining through visual programming or Python scripting. Components for machine learning. Add-ons for bioinformatics and text mining. Packed with features for data analytics.

I am currently evaluating various front-ends for modeling various project workflows and visual programming is very appealing for this purpose. Orange’s environment looks as follows:

orange0b_scrshotsOrange is already included in the list of data mining software I compiled back in 2013

An Attempt to Organize Datamining Resources

0. R Cheatsheets

1. Choosing Visualization Tools

Three Golden Rules of Visualization Tools

Rule #1: No tool will turn you into a pro.
Rule #2: First learn one single tool very well.
Rule #3: Choose tools you are totally in love with.


The main website by the author, Hadley Wickham:

* Ggplot2 package will give you the most return on the time you invest learning how to use it
A quick reference (cheatsheet) for ggplot2 “Data Visualization”
A short intro/tutorial for ggplot2


The main website:

Ggvis is used “…more for data exploration than data presentation. …ggvis makes many more assumptions about what you’re trying to do: this allows it to be much more concise, at some cost of generality.”
* “Ggvis provides a tree like structure allowing properties and data to be specified once and inherited by children.
Ggvis vs Ggplot2
Range selector for ggvis

2. Choosing Tools for Interactivity


The main website:

Shiny simply turns your R into a web server and lets you interact with your data through a browser. See the cheatsheet “Shiny” (also above).
Shiny is ok to start with, however you might wish to extend it with widgets or whatever fits your needs best.


The main website:


cons: large datasets might need to be uploaded to the client for some widgets

3. Building a Dashboard

Dashboard Theory

Stephen Few

Stephen’s Website
His book “Information Dashboard Design” on Amazon
Why Most Dashboards Fail (pdf)

Dashboards are Dumb (or how we sometimes delude ourselves with fancy dashboards)

The essence in one quote: “The key to usability is the association between appropriate controllers and the individual meters. In a car, the controllers are the steering wheel, the gas pedal, the brake pedal, the ignition switch, and the gearshift, primarily. Generally, there are one or two controllers associated with each meter and the action of each controller is usually proportional to the metric that appears on the meter (e.g. Gas pedal and brake pedal control speed; gas pedal and gear shift control RPM, etc.). There are more controllers on a plane, but the same relationships hold between controllers and meters, at least for older planes.”

Risk Communication Dashboards (pdf)

Nine User Interface Design Patterns

Ten Tips to Design User-Friendly Dashboards

Shiny and GoogleVis
EAHU scrsht



Security Dashboards in Shiny

Dashboard design Using MS Excel *

* In case you have to use Excel, have a look at “Sparklines for Excel” maintained by Fabrice Rimlinger:

4. Managing Your Workflow

A workflow is used to automate repetitive operations you perform on the data. In case you generate so much data it turns into a hard-to-use pile, as was in my case, you can plan ahead and have a look at various tools that suit your needs. I am still a long way from organizing every aspect of the project into a coherent system, but my preliminary survey of available software makes me think that DAWN (see below) seems to be most flexible; however, it requires most programming skills. Other tools, such as Rapid Miner or Weka, can be used with the R programming environment almost out of the box.

Rapid Miner (open source) (R is integrated via a standard plugin downloadable from within the software itself)

Dawn Science (open source)

Data Analysis WorkbeNch (DAWN) is an eclipse based workbench for doing scientific data analysis. It implements sophisticated support for the following:
(1) Visualization of data in 1D, 2D and 3D
(2) Python script development, debugging and execution
(3) Processing and Workflows for visual algorithms analyzing scientific data (use the source code & eclipse as the base)

Weka (open source)

How to integrate R into Weka:

Magittr (R package) (included in dplyr package dependency)

This R package brings “forward-piping” operators, e.g. %>% (Just see the ‘cheatsheet’ “Data Wrangling” above.)
quote from the description of the package: “The magrittr package offers a set of operators which promote semantics that will improve your code by structuring sequences of data operations left-to-right (as opposed to from the inside and out), avoiding nested function calls, minimizing the need for local variables and function definitions, and making it easy to add steps anywhere in the sequence of operations.”

Other Datamining Software (commercial and open source)

5. Data Mining/Analytics Workflow Theory

Introduction to Data Mining


Understanding Data Analytics Project Life Cycle

6. Useful Quotes from R-Bloggers, Mostly

An Introduction to Statistical Learning with Applications in R (free pdf)
“This book provides an introduction to statistical learning methods. It is aimed for upper level undergraduate students, masters students and Ph.D. students in the non-mathematical sciences. The book also contains a number of R labs with detailed explanations on how to implement the various methods in real life settings, and should be a valuable resource for a practicing data scientist.”

Elements of Statistical Learning (free pdf)
“The go-to bible for this data scientist and many others is The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Each of the authors is an expert in machine learning / prediction, and in some cases invented the techniques we turn to today to make sense of big data: ensemble learning methods, penalized regression, additive models and nonparemetric smoothing, and much much more.”

Machine learning

In-depth introduction to machine learning — 15 hours of expert videos

Free Ebooks on Machine Learning

Why you should learn R first for data science (selected quotes below):

Data wrangling
“It’s often said that 80% of the work in data science is data manipulation. … R has some of the best data management tools you’ll find. The dplyr package in R makes data manipulation easy. … When you “chain” the basic dplyr together, you can dramatically simplify your data manipulation workflow.”

Data visualization
“ggplot2 is one of the best data visualization tools around, as of 2015. What’s great about ggplot2 is that as you learn the syntax, you also learn how to think about data visualization. … there is a deep structure to all statistical visualizations. There is a highly structured framework for thinking about and creating all data visualizations. ggplot2 is based on that framework. By learning ggplot2, you will learn how to think about visualizing data.

Moreover, when you combine ggplot2 and dplyr together (using the chaining methodology), finding insight in your data becomes almost effortless.”

Machine learning
“While … most beginning data science students should wait to learn machine learning (it is much more important to learn data exploration first), machine learning is an important skill. When data exploration stops yielding insight, you need stronger tools … [and] R has some of the best tools and resources.

One of the best, most referenced introductory texts on machine learning, An Introduction to Statistical Learning, teaches machine learning using the R programming language. Additionally, the Stanford Statistical Learning course uses this textbook, and teaches machine learning in R.”

Data Sources

Quandl — free & premium financial market data (think “free Bloomberg in the format you want”)

Over 70 free large data repositories (updated) — a broad range of data (including finance related)

FDF Financial Data Finder

Datasets for Data Mining and Data Science at KDnuggets

Quant Finance Resources at CalTech

Ideas, Bells, and Whistles

Working with Time Series
Graphing Highly Skewed Data
In 4 Steps your Application (including R) is running on a Cloud Computing Cluster
Eight New Ideas From Data Visualization Experts
Hierarchical Clustering with R (featuring D3.js and Shiny)
A Growing List of 20+ Free Ebooks on Datamining
Big Data Made Simple: Feed on Visualization
My collection of visualization and datamining software and libraries

 7. Where to Ask for Help

General R questions

#R channel at Freenode (IRC network) — perhaps, the fastest way to get help with R


Shiny at
Shiny Google Group

DataKind — an organization that provides (volunteer) data science services to non-profits around the world

DataKind — an organization that provides (volunteer) data science services to non-profits around the world


My thanks go to this blog:


Natural Language Processing and Data Mining Links

Data Mining Fusion: Graphing and Charting

Note: My personal choice fell on Shiny, as that the most flexible front-end for interactive 2-D and 3-D visualization for my purposes ( coupled with R (a language for high performance statistical computing).

Visualization Libraries


free and open source software for statistical computing and graphics (with numerous visualization packages)

Shiny (a notable dynamic visualization extension for R)

ggplot2 (a notable static visualization extension for R)

RStudio (IDE for R)

StatET (Eclipse-based IDE for R)

scatter3d: (downloadable from within R IDE)

scatterplot3d: (downloadable from within R IDE)

cloudplot: (downloadable from within R IDE)

rgl:… is a 3D visualization system based on OpenGL. It provides a medium to high level interface for use in R, currently modelled on classic R graphics, with extensions to allow for interaction. An rgl device at its core is a real-time 3D engine written in C++. It provides an interactive viewpoint navigation facility (mouse + wheel support) and an R programming interface.


GUI for data mining using R


open source software (license) built on the Eclipse/RCP platform in order to scale to address a wide range of applications and to benefit from the workbench and advanced plugin system implemented in Eclipse

datamining / exploration workflows


the world-leading open-source system for data mining


very similar to OpenFrameworks (, but written not in C++ but in Java


matplotlib: /* outstanding */

python, (the author passed away in 2012, regretfully). matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. matplotlib can be used in python scripts, the python and ipython shell (ala MATLAB®* or Mathematica®†), web application servers, and six graphical user interface toolkits.


Javascript charting library for jQuery

Highcharts: /* outstanding */

Highcharts is a charting library written in pure HTML5/JavaScript, offering intuitive, interactive charts to your web site or web application. Highcharts currently supports line, spline, area, areaspline, column, bar, pie, scatter, angular gauges, arearange, areasplinerange, columnrange, bubble, box plot, error bars, funnel, waterfall and polar chart types.

Highstock: /* outstanding */ /* free for non-commercial use */

Highstock is solely based on native browser technologies and doesn’t require client side plugins like Flash or Java. Furthermore you don’t need to install anything on your server. No PHP or ASP.NET. Highstock needs only two JS files to run: The highstock.js core and either the jQuery, MooTools or Prototype framework. One of these frameworks is most likely already in use in your web page.



[plugins]:[draggable points]:


a library for making high-quality scientific graphics under Linux and Windows; a library for the fast data plotting and data processing of large data arrays; a library for working in window and console modes and for easy embedding into other programs; a library with large and growing set of graphics.

* 11 November 2013. New version (v.2.2) of MathGL is released. There are speeding up, new plot kinds and data handling functions, new plot styles, masks for bitmap output, wx-widget, Lua interface.

* Javascript interface was developed with support of $DATADVANCE company.

gnuplot interfaces in ANSI C:

gnuplot_i talks to a gnuplot process by means of POSIX pipes. This implies that the underlying operating system has the notion of processes and pipes, and advertizes them in a POSIX fashion. Since Windows does not respect this standard, this module will not compile on it, unless you have a compiler that offers a popen call on that platform or simulates it.

[real time data streams]:


Kst is the fastest real-time large-dataset viewing and plotting tool available (you may be interested in some benchmarks) and has built-in data analysis functionality. Kst is very user-friendly (both the community and the program itself!). Kst contains many powerful built-in features and is expandable with plugins and extensions (see developer information in the “Ressources” section). Kst is licensed under the GPL, and is as such freely available for anyone. What’s more, as of 2.0.x it is available on all of the following platforms: Microsoft Windows, Linux, Mac OSX. Note that KDE libraries are an optional dependency (i.e. you can run Kst without KDE, but you get additional features when running on a platform with KDE). See the “Downloads” section for pre-compiled executables or the sources.

Gigasoft ProEssentials:

Visual Studio.Net, ActiveX, C++ MFC

FindGraph: chart fx:

a range of platforms, including c++, java, html5, com, silverlight


visual basic and c++, good free help file, Chart control for Windows Forms

Nevron: xygraph:

delphi (source code available)


very similar to Processing (, but written not in Java but in C++

ofxChart is a custom add-on for OpenFrameworks C++ library. It allows adding various 2d and 3d charts to your projects.

(*GUI controls: ofxUI:, )

(*GUI controls: ofxRemoteUI: )


.NET, Java, ASP, COM,VB, PHP, Perl, Python,Ruby, ColdFusion, C++


c/c++ (includes CINT, c/c++ interpreter)


(noncommercial is free) C, Fortran 77 and Fortran 90/95. For some operating systems, the languages Perl, Python, Java and the C/C++ interpreter Ch are also supported


freeware, C, C++


nPlot – a minimalistic data analysis application (c?/c++?), GLib was used @ some point

NPlot charting library:

NPlot (formerly known as scpl) is a free charting library for .NET. It boasts an elegant and flexible API. NPlot includes controls for Windows.Forms, ASP.NET and a class for creating Bitmaps. A GTK# control is also available.


just an example of code using plotutils


java, javascript, a simple cross mode GUI library for the Processings.


java, processing, 2d 3d graphs


java, processing, 2d graphs (full interactive capabilities of Processing)


java, processing


provides a high-level, simple scenegraph for Processing, modeled on the API for the scenegraph and display list implemented by ActionScript 3. Nest is targeted toward developers familiar with AS3, who wish to organize on-screen objects in a display list hierarchy. As with the AS3 display list, Nest establishes parent-child relationships, applies parent transformations to children, and allows easy manipulation of on-screen objects through member variables such as x, y, rotation, and scale. In addition to the scenegraph, Nest also includes an event-based communication system (built on the Observer pattern as implemented by Java’s Observer interface), and some minimal UI components.


dashboard/visualization, connectable to any platform / database / text data import


delphi, c++ (both $$$ & free opensource)


TAChart, a similar to teechart open-source implementation, is bundled with the LCL of the Lazarus IDE (free pascal)

Orange: /* outstanding for multidimensional data */

python (visualization & datamining) (freeware opensource)


scientific computation and visualization environment,  BeanShell, Jython (the Python programming language), Groovy and JRuby (Ruby programming language). This brings more power and simplicity for scientific computation. The programming can also be done in native Java. Finally, symbolic calculations can be done using Matlab/Octave high-level interpreted language.


Professional Open-Source Software KNIME [naim] is a user-friendly graphical workbench for the entire analysis process: data access, data transformation, initial investigation, powerful predictive analytics, visualisation and reporting. The open integration platform provides over 1000 modules (nodes), including those of the KNIME community and its extensive partner network.

RPy and RPy2:

rpy2 is a redesign and rewrite of rpy. It is providing a low-level interface to R, a proposed high-level interface, including wrappers to graphical libraries, as well as R-like structures and functions.


an open source visualization program for exploring high-dimensional data. It provides highly dynamic and interactive graphics such as tours, as well as familiar graphics such as the scatterplot, barchart and parallel coordinates plots. Plots are interactive and linked with brushing and identification. GGobi is fully documented in the GGobi book: “Interactive and Dynamic Graphics for Data Analysis”.




Based on PyQwt (plotting widgets for PyQt4 graphical user interfaces) and on the scientific modules NumPy and SciPy, guiqwt is a Python library providing efficient 2D data-plotting features (curve/image visualization and related tools) for interactive computing and signal/image processing application development. Guiqwt plotting features are quite limited in terms of plot types compared to matplotlib. However the currently implemented plot types are much more efficient.

Enthought Tool Suite:

The Enthought Tool Suite (ETS) is a collection of components developed by Enthought and our partners, which we use every day to construct custom scientific applications. It includes a wide variety of components, including: an extensible application framework, application building blocks, 2-D and 3-D graphics libraries, scientific and math libraries, developer tools. The cornerstone on which these tools rest is the Traits package, which provides explicit type declarations in Python; its features include initialization, validation, delegation, notification, and visualization of typed attributes.


Chaco is a Python plotting application toolkit that facilitates writing plotting applications at all levels of complexity, from simple scripts with hard-coded data to large plotting programs with complex data interrelationships and a multitude of interactive tools. While Chaco generates attractive static plots for publication and presentation, it also works well for interactive data visualization and exploration.


Mayavi seeks to provide easy and interactive visualization of 3-D data. It offers: (1) An (optional) rich user interface with dialogs to interact with all data and objects in the visualization. (2) A simple and clean scripting interface in Python, including one-liners, or an object-oriented programming interface. Mayavi integrates seamlessly with numpy and scipy for 3D plotting and can even be used in IPython interactively, similarly to Matplotlib. (3) The power of the VTK toolkit, harnessed through these interfaces, without forcing you to learn it. (4) Additionally Mayavi is a reusable tool that can be embedded in your applications in different ways or combined with the Envisage application-building framework to assemble domain-specific tools.


Macros are a quick way to customize and extend Canopy. They can help you to automate tasks which are frequent or complicated.

Qwt:, (example: )

Qwt 6.1 might be usable in all environments where you find Qt. It is compatible with Qt4 ( >= 4.4 ) and Qt5. (Curve Plots, Scatter Plot, Spectrogram, Contour Plot, Histogram, Dials, Compasses, Knobs, Wheels, Sliders, Thermos)


wxChart is a control that allows you to create charts. At the moment the type of charts available are Bar, Bar 3D, Pie and Pie 3D. Other chart types will be added soon.

Anti-Grain Geometry(AGG): /*outstanding*//*latest update was in 2007, author might be busy, perhaps it’s time to move the project to github*/

Anti-Grain Geometry (AGG) is an Open Source, free of charge graphic library, written in industrially standard C++. The terms and conditions of use AGG are described on The License page. AGG doesn’t depend on any graphic API or technology. Basically, you can think of AGG as of a rendering engine that produces pixel images in memory from some vectorial data. But of course, AGG can do much more than that. The ideas and the philosophy of AGG are: Anti-Aliasing. Subpixel Accuracy. The highest possible quality. High performance. Platform independence and compatibility. Flexibility and extensibility. Lightweight design. Reliability and stability (including numerical stability).

wxArt2D :

WxArt2D is a library for 2D graphical programming. WxArt2D is build on top of the wxWidgets Library. It is build around a document View Framework, and has several graphical drawing context classes. You can display (multiple and different levels) views of a document filled with a hierachy of graphical objects. Tools allow you to zoom, drag, edit etc. the objects on the view.

wxMaxPlot: http://wxmathplot.sourceforge.net

a library to add 2D scientific plot functionality to wxWidgets. It allows to embed inside your program a window for plotting scientific, statistical or mathematical data, with additions like legend or coordinate display in overlay. Multi-platform: runs everywhere wxWidgets does.

[py plotting tools]:

… might need to use Boost.Python:, a good example of its use:

ComponentOne Chart:

ComponentOne Studio for WinRT XAML includes UI controls for data visualization, layout and input. Based on the ComponentOne Silverlight controls and designed to enhance the rich user experience.


The idea is to provide a pure ansi/iso c++ plot class (called PPlot). Of course no actual plotting can be done in c++. The connection to the graphical world (widgets) is done via an abstract class that you have to implement. The class is called Painter and asks you to implement things like  draw a line from (x1,y1) to (x2,y2) draw a text at position (x,y) calculate width of a text when drawn on screen. (implemented the Painter class in QT (a nice c++ framework) and Zinc (an obscure API used in real time computing).)


C++11 plotting library for console apps

TeeChart Pro:

.NET, Java, ActiveX / COM, PHP and Delphi VCL / FireMonkey controls for business, Real-time, Financial, Scientific and Mobile applications.


Graph-tool is an efficient Python module for manipulation and statistical analysis of graphs (a.k.a. networks). Contrary to most other python modules with similar functionality, the core data structures and algorithms are implemented in C++, making extensive use of template metaprogramming, based heavily on the Boost Graph Library. This confers it a level of performance which is comparable (both in memory usage and computation time) to that of a pure C/C++ library. (Many algorithms are implemented in parallel using OpenMP)


Graphviz – Graph Visualization Software. Drawing graphs since 1988


Cairo is a 2D graphics library with support for multiple output devices. Currently supported output targets include the X Window System (via both Xlib and XCB), Quartz, Win32, image buffers, PostScript, PDF, and SVG file output. Experimental backends include OpenGL, BeOS, OS/2, and DirectFB. Cairo is designed to produce consistent output on all output media while taking advantage of display hardware acceleration when available (eg. through the X Render Extension). The cairo API provides operations similar to the drawing operators of PostScript and PDF. Operations in cairo including stroking and filling cubic Bézier splines, transforming and compositing translucent images, and antialiased text rendering. All drawing operations can be transformed by any affine transformation (scale, rotation, shear, etc.) Cairo is implemented as a library written in the C programming language, but bindings are available for several different programming languages. Cairo is free software and is available to be redistributed and/or modified under the terms of either the GNU Lesser General Public License (LGPL) version 2.1 or the Mozilla Public License (MPL) version 1.1 at your option.



Gadfly is a plotting and data visualization system written in Julia. It’s influenced heavily by Leland Wilkinson’s book The Grammar of Graphics and Hadley Wickham’s refinment of that grammar in ggplot2. It renders publication quality graphics to PNG, Postscript, PDF, SVG, and Javascript. The Javascript backend uses d3 to add interactivity like panning, zooming, and toggling.




Data-Driven Documents (D3):

D3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life using HTML, SVG and CSS. D3’s emphasis on web standards gives you the full capabilities of modern browsers without tying yourself to a proprietary framework, combining powerful visualization components and a data-driven approach to DOM manipulation.


Stock market, commodity and technical analysis charting app based on the Qt toolkit. Extendible plugin system for quotes and indicators. Portfolio, back testing, chart objects and many more features included.

[ta-lib]:  http://ta-lib.org

[adobe flash online charting]:

Visualization Library:

C++ (site offline)


WxArt2D is a library for 2D graphical programming. WxArt2D is build on top of the wxWidgets Library

Simple Directmedia Layer (SDL):

Simple DirectMedia Layer is a cross-platform development library designed to provide low level access to audio, keyboard, mouse, joystick, and graphics hardware via OpenGL and Direct3D.

Development Libraries: Windows: (Visual C++ 32/64-bit) SDL2-devel-2.0.1-mingw.tar.gz (MinGW 32/64-bit), Linux.


Gosu is a 2D game development library for the Ruby and C++ programming languages, available for Mac OS X, Windows, and Linux.

Data Visualization References

[software list]:

[softpedia charting software list]:

[30 best tools for data viz’n]:

[blog dedicated to graphs & charts]:

Free Technical Analysis Libraries

TA-Lib : Technical Analysis Library:

Multi-Platform Tools for Market Analysis … TA-Lib is widely used by trading software developers requiring to perform technical analysis of financial market data., Includes 200 indicators such as ADX, MACD, RSI, Stochastic, Bollinger Bands etc… (more info), Candlestick pattern recognition, Open-source API for C/C++, Java, Perl, Python and 100% Managed .NET. Free Open-Source Library. TA-Lib is available under a BSD License allowing it to be integrated in your own open-source or commercial application.


Numerical Analysis Software

GNU Radio:

* uhd_fft – A very simple spectrum analyzer tool

* Extending GNU Radio in C++:

* GNU Radio Companion (GRC) is a graphical tool for creating signal flow graphs and generating flow-graph source code

Data Mining Resources


Weka 3: Data Mining Software in Java, Weka—Machine Learning Software in Java


spss alternative opensource

Scicos :

Scicos is a graphical dynamical system modeler and simulator developed in the Metalau project at INRIA, Paris-Rocquencourt center. With Scicos, user can create block diagrams to model and simulate the dynamics of hybrid dynamical systems and compile models into executable code. Scicos is used for signal processing, systems control, queuing systems, and to study physical and biological systems. New extensions allow generation of component based modeling of electrical and hydraulic circuits using  the Modelica language.

[lists] /*outstanding!*/ //the author of the data in the previous link