coding beacon

[programming & visualization]

Shiny: Extensions

Machine Learning Models of My Personal Interest (work in progress)

Convolutional Neural Networks

http://deeplearning.net/tutorial/lenet.html#lenet

Convolutional Neural Networks (CNN) are biologically-inspired variants of MLPs. From Hubel and Wiesel’s early work on the cat’s visual cortex [Hubel68], we know the visual cortex contains a complex arrangement of cells. These cells are sensitive to small sub-regions of the visual field, called a receptive field. The sub-regions are tiled to cover the entire visual field. These cells act as local filters over the input space and are well-suited to exploit the strong spatially local correlation present in natural images.

Additionally, two basic cell types have been identified: Simple cells respond maximally to specific edge-like patterns within their receptive field. Complex cells have larger receptive fields and are locally invariant to the exact position of the pattern.

The animal visual cortex being the most powerful visual processing system in existence, it seems natural to emulate its behavior. Hence, many neurally-inspired models can be found in the literature. To name a few: the NeoCognitron [Fukushima], HMAX [Serre07] and LeNet-5 [LeCun98], which will be the focus of this tutorial.

http://stackoverflow.com/questions/1313336/convolutional-neural-network-how-to-get-the-feature-maps?rq=1

alt text


Random Forests

Slides: http://www.cs.ubc.ca/~nando/540-2013/lectures.html

Andrew Ng: Deep Learning

Self-Taught Learning and Unsupervised Feature Learning


Choosing GPU Hardware for Neural Network Modeling (draft)

Hardware Interaction Model

CUDA

It is not yet clear to me whether whether specific NVidia hardware makes a lot of difference. More on this later. In the mean time, here’s a library that claims to need Fermi/Tesla or equivalent (“Fermi-generation GPU (GTX 4xx, GTX 5xx, or Tesla equivalent required.”) What the reason is, I can only guess.

https://code.google.com/p/cuda-convnet/

OpenCL

http://deeplearning.net/software/theano/tutorial/using_gpu.html (mentioned in the second half of the page)

More on this later

GPU Use with Neural Networks (Other Sources)

http://deeplearning.net/software/theano/tutorial/using_gpu.html

Open Source Data Mining Software: Orange

The website (http://orange.biolab.si/) describes Orange as follows: “Open source data visualization and analysis for novice and experts. Data mining through visual programming or Python scripting. Components for machine learning. Add-ons for bioinformatics and text mining. Packed with features for data analytics.

I am currently evaluating various front-ends for modeling various project workflows and visual programming is very appealing for this purpose. Orange’s environment looks as follows:

snp-schema-selection-evaluation

http://orange.biolab.si/screenshots/
orange0b_scrshotsOrange is already included in the list of data mining software I compiled back in 2013 https://wordpress.com/post/55540359/77.

Data Visualization cheatsheet, plus Spanish translations

RStudio Blog

data visualization cheatsheet

We’ve added a new cheatsheet to our collection. Data Visualization with ggplot2 describes how to build a plot with ggplot2 and the grammar of graphics. You will find helpful reminders of how to use:

  • geoms
  • stats
  • scales
  • coordinate systems
  • facets
  • position adjustments
  • legends, and
  • themes

The cheatsheet also documents tips on zooming.

Download the cheatsheet here.

Bonus – Frans van Dunné of Innovate Online has provided Spanish translations of the Data Wrangling, R Markdown, Shiny, and Package Development cheatsheets. Download them at the bottom of the cheatsheet gallery.

View original post

An Attempt to Organize Datamining Resources

0. R Cheatsheets 

http://www.rstudio.com/resources/cheatsheets/


1. Choosing Visualization Tools

Three Golden Rules of Visualization Tools

Rule #1: No tool will turn you into a pro.
Rule #2: First learn one single tool very well.
Rule #3: Choose tools you are totally in love with.

Ggplot2

The main website by the author, Hadley Wickham: http://ggplot2.org/

* Ggplot2 package will give you the most return on the time you invest learning how to use it
A quick reference (cheatsheet) for ggplot2 “Data Visualization”
A short intro/tutorial for ggplot2

Ggvis

The main website: http://ggvis.rstudio.com/

Ggvis is used “…more for data exploration than data presentation. …ggvis makes many more assumptions about what you’re trying to do: this allows it to be much more concise, at some cost of generality.”
* “Ggvis provides a tree like structure allowing properties and data to be specified once and inherited by children.
Ggvis vs Ggplot2
Range selector for ggvis


2. Choosing Tools for Interactivity

Shiny

The main website: http://shiny.rstudio.com/gallery/

Shiny simply turns your R into a web server and lets you interact with your data through a browser. See the cheatsheet “Shiny” (also above).
Shiny is ok to start with, however you might wish to extend it with widgets or whatever fits your needs best.

Htmlwidgets

The main website: http://www.htmlwidgets.org/develop_intro.html

pros:
https://rstudio.github.io/dygraphs/gallery-range-selector.html
https://christophergandrud.github.io/networkD3/
http://www.htmlwidgets.org/showcase_threejs.html
https://github.com/htmlwidgets/sparkline

cons: large datasets might need to be uploaded to the client for some widgets


3. Building a Dashboard

Dashboard Theory

Stephen Few

Stephen’s Website
His book “Information Dashboard Design” on Amazon
Why Most Dashboards Fail (pdf)

Dashboards are Dumb (or how we sometimes delude ourselves with fancy dashboards)

The essence in one quote: “The key to usability is the association between appropriate controllers and the individual meters. In a car, the controllers are the steering wheel, the gas pedal, the brake pedal, the ignition switch, and the gearshift, primarily. Generally, there are one or two controllers associated with each meter and the action of each controller is usually proportional to the metric that appears on the meter (e.g. Gas pedal and brake pedal control speed; gas pedal and gear shift control RPM, etc.). There are more controllers on a plane, but the same relationships hold between controllers and meters, at least for older planes.”

Risk Communication Dashboards (pdf)

Nine User Interface Design Patterns

Ten Tips to Design User-Friendly Dashboards

Shiny and GoogleVis

http://www.r-bloggers.com/dashboards-in-r-with-shiny-and-googlevis/
EAHU scrsht

Shinydashboard

http://glimmer.rstudio.com/reinholdsson/shiny-dashboard/
shinydashboard001a

Examples

Security Dashboards in Shiny

Dashboard design Using MS Excel *

* In case you have to use Excel, have a look at “Sparklines for Excel” maintained by Fabrice Rimlinger: http://sparklines-excel.blogspot.com/


4. Managing Your Workflow

A workflow is used to automate repetitive operations you perform on the data. In case you generate so much data it turns into a hard-to-use pile, as was in my case, you can plan ahead and have a look at various tools that suit your needs. I am still a long way from organizing every aspect of the project into a coherent system, but my preliminary survey of available software makes me think that DAWN (see below) seems to be most flexible; however, it requires most programming skills. Other tools, such as Rapid Miner or Weka, can be used with the R programming environment almost out of the box.

Rapid Miner (open source)

https://rapidminer.com/ (R is integrated via a standard plugin downloadable from within the software itself)
rapidminer01

Dawn Science (open source)

Data Analysis WorkbeNch (DAWN) is an eclipse based workbench for doing scientific data analysis. It implements sophisticated support for the following:
(1) Visualization of data in 1D, 2D and 3D
(2) Python script development, debugging and execution
(3) Processing and Workflows for visual algorithms analyzing scientific data

http://www.dawnsci.org/ (use the source code & eclipse as the base)
DAWNsci01

Weka (open source)

http://www.cs.waikato.ac.nz/ml/weka/
weka001_image_downloaded_from_decisiontrees_net

How to integrate R into Weka: http://markahall.blogspot.ru/2012/07/r-integration-in-weka.html

Magittr (R package)

http://cran.r-project.org/web/packages/magrittr/index.html (included in dplyr package dependency)

This R package brings “forward-piping” operators, e.g. %>% (Just see the ‘cheatsheet’ “Data Wrangling” above.)
quote from the description of the package: “The magrittr package offers a set of operators which promote semantics that will improve your code by structuring sequences of data operations left-to-right (as opposed to from the inside and out), avoiding nested function calls, minimizing the need for local variables and function definitions, and making it easy to add steps anywhere in the sequence of operations.”

Other Datamining Software (commercial and open source)

http://decisiontrees.net/decision-trees-and-data-mining-software/


5. Data Mining/Analytics Workflow Theory

Introduction to Data Mining

saedsayad01

Understanding Data Analytics Project Life Cycle


6. Useful Quotes from R-Bloggers, Mostly

An Introduction to Statistical Learning with Applications in R (free pdf)

http://www-bcf.usc.edu/~gareth/ISL/
“This book provides an introduction to statistical learning methods. It is aimed for upper level undergraduate students, masters students and Ph.D. students in the non-mathematical sciences. The book also contains a number of R labs with detailed explanations on how to implement the various methods in real life settings, and should be a valuable resource for a practicing data scientist.”

Elements of Statistical Learning (free pdf)

http://statweb.stanford.edu/~tibs/ElemStatLearn/download.html
“The go-to bible for this data scientist and many others is The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Each of the authors is an expert in machine learning / prediction, and in some cases invented the techniques we turn to today to make sense of big data: ensemble learning methods, penalized regression, additive models and nonparemetric smoothing, and much much more.”

Machine learning

In-depth introduction to machine learning — 15 hours of expert videos

Free Ebooks on Machine Learning

Why you should learn R first for data science

http://www.r-bloggers.com/why-you-should-learn-r-first-for-data-science/ (selected quotes below):

Data wrangling
“It’s often said that 80% of the work in data science is data manipulation. … R has some of the best data management tools you’ll find. The dplyr package in R makes data manipulation easy. … When you “chain” the basic dplyr together, you can dramatically simplify your data manipulation workflow.”

Data visualization
“ggplot2 is one of the best data visualization tools around, as of 2015. What’s great about ggplot2 is that as you learn the syntax, you also learn how to think about data visualization. … there is a deep structure to all statistical visualizations. There is a highly structured framework for thinking about and creating all data visualizations. ggplot2 is based on that framework. By learning ggplot2, you will learn how to think about visualizing data.

Moreover, when you combine ggplot2 and dplyr together (using the chaining methodology), finding insight in your data becomes almost effortless.”

Machine learning
“While … most beginning data science students should wait to learn machine learning (it is much more important to learn data exploration first), machine learning is an important skill. When data exploration stops yielding insight, you need stronger tools … [and] R has some of the best tools and resources.

One of the best, most referenced introductory texts on machine learning, An Introduction to Statistical Learning, teaches machine learning using the R programming language. Additionally, the Stanford Statistical Learning course uses this textbook, and teaches machine learning in R.”

Data Sources

Quandl — free & premium financial market data (think “free Bloomberg in the format you want”)

Over 70 free large data repositories (updated) — a broad range of data (including finance related)

FDF Financial Data Finder

Datasets for Data Mining and Data Science at KDnuggets

Quant Finance Resources at CalTech

Ideas, Bells, and Whistles

Working with Time Series
Graphing Highly Skewed Data
In 4 Steps your Application (including R) is running on a Cloud Computing Cluster
Eight New Ideas From Data Visualization Experts
Hierarchical Clustering with R (featuring D3.js and Shiny)
A Growing List of 20+ Free Ebooks on Datamining
Big Data Made Simple: Feed on Visualization
My collection of visualization and datamining software and libraries


 7. Where to Ask for Help

General R questions

#R channel at Freenode (IRC network) — perhaps, the fastest way to get help with R
StackOverflow

Shiny

Shiny at rstudio.com
Shiny Google Group


Databases / formats for timeseries data

Databases

MongoDB (NoSQL)

SciDB

HDF (HDF5)

why HDF? http://www.hdfgroup.org/why_hdf/
https://en.wikipedia.org/wiki/Hierarchical_Data_Format
http://www.inside-r.org/packages/cran/hdf5/docs/hdf5
http://www.space-research.org/
users: http://www.hdfgroup.org/HDF5/users5.html

Interfaces

http://blog.revolutionanalytics.com/2014/01/fast-and-easy-data-munging-with-dplyr.html

Database Management Software

Tadpole https://github.com/hangum/TadpoleForDBTools/wiki Tadpole DB Hub is Unified infrastructure tool, various environment based interface for managing Apache Hive, Amazon RDS, CUBRID, MariaDB, MySQL, Oracle, SQLite, MSSQL, PostgreSQL and MongoDB databases. WEB base client tools (https://marketplace.eclipse.org/content/tadpole-db-hub)

Toad http://www.toadworld.com/m/freeware Sort of like a tadpole, just bigger and fatter

DBeaver http://dbeaver.jkiss.org/ Free multi-platform database tool for developers, SQL programmers, database administrators and analysts. Supports all popular databases: MySQL, PostgreSQL, SQLite, Oracle, DB2, SQL Server, Sybase, MongoDB, etc

How to start using OpenCL ASAP

The following is written assuming the computer has Intel CPU:

Supported Targets

3rd Generation Intel Core Processors
Intel “Bay Trail” platforms with Intel HD Graphics
4th Generation Intel Core Processors, need kernel patch currently, see the “Known Issues” section.
5th Generation Intel Core Processors “Broadwell”.

To start programming right away, do the following:

1. Get Beignet. https://wiki.freedesktop.org/www/Software/Beignet/

Beignet is an open source implementation of the OpenCL specification – a generic compute oriented API. This code base contains the code to run OpenCL programs on Intel GPUs which basically defines and implements the OpenCL host functions required to initialize the device, create the command queues, the kernels and the programs and run them on the GPU.

In terms of the OpenCL 1.2 spec, beignet is quite complete now (at the time of writing, 28/03/2015).

2. Get OpenCL Studio http://opencldev.com/

title.jpg

The OpenCL Programming Book

http://www.fixstars.com/en/opencl/book/OpenCLProgrammingBook/contents/

Eclipse: prepare for OpenCL programming

http://stackoverflow.com/questions/21318112/how-to-prepare-eclipse-for-opencl-programming-intel-opencl-sdk-installed-in-li

http://marketplace.eclipse.org/content/opencl-development-tool

dplyR resources

A must watch dplyr package video:

Fast and easy data munging, with dplyr

http://blog.revolutionanalytics.com/2014/01/fast-and-easy-data-munging-with-dplyr.html

High Performance Computing Libraries (to be updated)

1. IRC channel #opencl at freenode network

2. https://blog.ajguillon.com/

3. Reference on installing pre-requisites (hardware drivers)

http://cran.r-project.org/web/packages/OpenCL/INSTALL

CUDA vs OpenCL

https://www.wikivs.com/wiki/CUDA_vs_OpenCL

http://streamcomputing.eu/blog/2010-01-28/opencl-the-battle-part-i/

http://programmers.stackexchange.com/questions/53410/cuda-vs-opencl-opinions#53699

http://blog.accelereyes.com/blog/2012/02/17/opencl_vs_cuda_webinar_recap/

ArrayFire (open source) http://arrayfire.com/

“ArrayFire supports both CUDA-capable NVIDIA GPUs and most OpenCL devices, including AMD GPUs/APUs and Intel Xeon Phi co-processors. It also supports mobile OpenCL devices from ARM, Qualcomm, and others. We want your code to run as fast as possible, regardless of the hardware.”

“ArrayFire is a blazing fast software library for GPU computing. Its easy-to-use API and array-based function set make GPU programming simple. A few lines of code in ArrayFire can replace dozens of lines of raw GPU code, saving you valuable time and lowering development costs.”

Getting Started

(written prior to being open-source): http://blog.accelereyes.com/blog/2013/03/04/arrayfire-examples-part-1-of-8-getting-started/

Overview by NVidia: http://devblogs.nvidia.com/parallelforall/arrayfire-portable-open-source-accelerated-computing-library/

* Download here

http://arrayfire.com/download/ & install, and append “%AF_PATH%/lib;” to your PATH env.variable

* Sources

https://github.com/arrayfire/arrayfire

* Files required to use ArrayFire from R (prerequisites: source files above)

https://github.com/arrayfire/arrayfire_r