Friday, May 24, 2013

List of NLP libraries and frameworks


List of NLP libraries and frameworks

http://mallet.cs.umass.edu/
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

http://gate.ac.uk/
GATE has grown over the years to include a desktop client for developers, a workflow-based web application, a Java library, an architecture and a process

http://opennlp.apache.org/
supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution.OpenNLP also includes maximum entropy and perceptron based machine learning.

http://nltk.org/
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

Sunday, February 19, 2012

Hadoop Tools Universe

Tools and stuff related to the Hadoop ecosystem

Overview Links
http://www.slideshare.net/joshwills/hadoop-and-machine-learning


Configuration Management
Puppet
http://hstack.org/hstack-automated-deployment-using-puppet/

Chef
http://blog.milford.io/2011/03/first-github-post-hadoop-chef-cookbook/


Coordination Service
ZooKeeper
http://zookeeper.apache.org/
http://www.quora.com/Why-is-Apache-ZooKeeper-used-along-with-Hadoop


Storage
Distributed schema-less storage
HDFS
Ceph

Append only storage and metadata
Avro
RCFile
HCatalog

Mutable key-value storage and metadata
HBase


Integration
Tool access
FUSE
JDBC
ODBC

Data Ingestion
Flume
Sqoop


Data Prep/Feature Engineering
Languages/Environments
PigLatin
HiveQL

Java/Scala APIs
Crunch (Cloudera)
Scoobi (NICTA)
Cascading (Concurrent)
Jaql (IBM)


Machine Learning
Apache Mahout
http://mahout.apache.org/

SystemLM (IBM)

R-based Systems
Sugue
RHIPE
RHadoop
Ricardo (IBM)



Sunday, November 20, 2011

Octave Tips, Links, Tutes, Libraries


Collected Tips and Links for GNU Octave


Tips and Links:
Change default prompt and clear on startup:
Edit "C:\Octave\3.2.4_gcc-4.4.0\share\octave\site\m\startup\octaverc" and add:
PS1(">> ")
clc
See also:
http://www.gnu.org/software/octave/doc/interpreter/Customizing-the-Prompt.html
http://www.gnu.org/software/octave/doc/interpreter/Startup-Files.html

Debugging:
The keyboard function returns control to the user
See also:
http://www.gnu.org/software/octave/doc/interpreter/Debugging.html

Useful Commands:
whos - Variables in the current scope with sizes and bytes


Tutorials:
YAGTOM: Yet Another Guide TO Matlab


Libraries:
PMTK supports a large variety of probabilistic models, including linear and logistic regression models (optionally with kernels), SVMs and gaussian processes, directed and undirected graphical models, various kinds of latent variable models (mixtures, PCA, HMMs), etc.  Associated textbook.

Thursday, November 17, 2011

New online Stanford courses for 2012

Wow, new free online Stanford courses for Jan 2012!  Really enjoying the ML and AI classes offered in 2011 and have heard the DB class is good too.

EDIT: wow they keep coming for 2012...will keep updating...

EDIT 2: even better see: www.class-central.com
and www.online-education.za.org/all_available

Computer Science:
http://www.nlp-class.org  Natural Language Processing (23 Jan 2012)
http://www.pgm-class.org  Probabilistic Graphical Models (Jan 2012)
http://www.game-theory-class.org/  Game Theory (late Feb 2012)
http://www.hci-class.org/  Human-Computer Interfaces (Jan 2012)
http://www.saas-class.org/  Software Engineering for Software as a Service
http://www.cs101-class.org/  Computer Science 101
http://jan2012.ml-class.org/  Machine learning
http://www.algo-class.org/  Design and Analysis of Algorithms I (23 Jan 2012)
http://www.crypto-class.org/  Cryptography

Entrepreneurship:
http://www.launchpad-class.org/  The Lean Launchpad
http://www.venture-class.org/  Technology Entrepreneurship

Electrical Engineering
http://www.infotheory-class.org/ Information Theory  (Mar 2012)

Complex Systems
http://www.modelthinker-class.org/  Model Thinking  (23 Jan 2012)


Civil Engineering
http://www.greenbuilding-class.org/  Making Green Buildings

Medicine
http://www.anatomy-class.org/  Anatomy

Thursday, October 20, 2011

Javascript Libraries and Links

Web Client Libraries and Frameworks
jQuery
jQuery is a fast and concise JavaScript Library that simplifies HTML document traversing, event handling, animating, and Ajax interactions for rapid web development.

Underscore.js
Underscore is a utility-belt library for JavaScript that provides a lot of the functional programming support that you would expect in Prototype.js (or Ruby), but without extending any of the built-in JavaScript objects. It's the tie to go along with jQuery's tux.

Modernizr
Modernizr is your starting point for making the best websites and applications that work exactly right no matter what browser or device your visitors use.

Knockout.js
Simplify dynamic JavaScript UIs by applying the Model-View-View Model (MVVM)

Backbone.js
Backbone supplies structure to JavaScript-heavy applications by providing models with key-value binding and custom events, collections with a rich API of enumerable functions, views with declarative event handling, and connects it all to your existing application over a RESTful JSON interface.


JS and CSS Toolkits
Google Libraries API
The Google Libraries API is a content distribution network and loading architecture for the most popular, open-source JavaScript libraries.

CoffeeScript
CoffeeScript is a little language that compiles into JavaScript. Underneath all those awkward braces and semicolons, JavaScript has always had a gorgeous object model at its heart. CoffeeScript is an attempt to expose the good parts of JavaScript in a simple way. Also see: dart

lesscss
LESS extends CSS with dynamic behavior such as variables, mixins, operations and functions. LESS runs on both the client-side (IE 6+, Webkit, Firefox) and server-side, with Node.js and Rhino.

twitter bootstrap
Simple and flexible HTML, CSS, and Javascript for popular user interface components and interactions.

requirejs.org
JavaScript file and module loader. It is optimized for in-browser use, but it can be used in other JavaScript environments, like Rhino and Node. Using a modular script loader like RequireJS will improve the speed and quality of your code.


JS InfoVis Libraries
Processing.js
Processing.js is the sister project of the popular Processing visual programming language, designed for the web. Processing.js makes your data visualizations, digital art, interactive animations, educational graphs, video games, etc. work using web standards and without any plug-ins.

d3.js
D3.js is a small, free JavaScript library for manipulating documents based on data and visualization framework. (From the authors of the now inactive protovis.)

Flot
Flot is a pure Javascript plotting library for jQuery. It produces graphical plots of arbitrary datasets on-the-fly client-side.

arborjs
Arbor is a graph visualization library built with web workers and jQuery. Rather than trying to be an all-encompassing framework, arbor provides an efficient, force-directed layout algorithm plus abstractions for graph organization and screen refresh handling.

Raphaël
Raphaël is a small JavaScript library that should simplify your work with vector graphics on the web. If you want to create your own specific chart or image crop and rotate widget, for example, you can achieve it simply and easily with this library.


JS Math Libraries
MathJax
MathJax is an open source JavaScript display engine for mathematics that works in all modern browsers.

jStat
jStat is a statistical library written in JavaScript that allows you to perform advanced statistical operations without the need of a dedicated statistical language (i.e. MATLAB or R).


On the Radar
Stripe.js
Stripe.js lets you build your own payment forms while still avoiding most PCI requirements.  Credit cards go directly to Stripe's secure environment, and never hit your servers.

three.js
Javascript 3D Engine. The aim of the project is to create a lightweight 3D engine with a very low level of complexity.

speak.js
Enables text-to-speech on the web using only JavaScript and HTML5. A port of the eSpeak speech synthesizer from C++ to JavaScript using Emscripten. Online demo: http://syntensity.com/static/espeak.html


Blog Posts Etc
Knockout.js vs. Backbone.js
Introducing Knockout, a UI library for JavaScript
A re-introduction to JavaScript (mozilla.org)
JavaScript for C# developers: writing a library
20 Fresh JavaScript Data Visualization Libraries


Tuesday, September 13, 2011

R Packages

Here is a list of R packages that I find useful:

ggplot2 - is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts. (Sep 2011)

caret - is a set of functions that attempt to streamline the process for creating predictive models. The package contains tools for: data splitting, pre-processing, model tuning using resampling, variable importance estimation. (short for Classification And REgression Training) (Sep 2011)

randomForest - Classification and regression based on a forest of trees using random inputs. (Dec 2011)

ada - Performs discrete, real, and gentle boost under both exponential and logistic loss on a given data set. The package ada provides a straightforward, well-documented, and broad boosting routine for classification, ideally suited for small to moderate-sized data sets. (Dec 2011)

gbm - Generalized Boosted Regression Models.  This package implements extensions to Freund and Schapire's AdaBoost algorithm and Friedman's gradient boosting machine. Includes regression methods for least squares, absolute loss, quantile regression, logistic, Poisson, Cox proportional hazards partial likelihood, and AdaBoost exponential loss. (Dec 2011)

dummies - Create dummy/indicator variables flexibly and efficiently.  Expands factors, characters and other eligible classes into dummy/indicator variables.

multicore - Overcome R's inefficient CPU usage.  This package provides a way of running parallel computations in R on machines with multiple cores or CPUs. Jobs can share the entire initial workspace and it provides methods for results collection.

Friday, August 5, 2011

List of Data Mining / Forcasting Competitions

Kaggle
http://www.kaggle.com/Competitions

TunedIT Solutions
http://tunedit.org/challenges/

Causality Workbench (none current as of Aug 2011)
http://www.causality.inf.ethz.ch/home.php

DARPA's Shredder Challenge (document reconstruction)
Closes December 5, 2011
http://www.shredderchallenge.com/

1st International Competition of Time Series Forecasting
Closes 10th of January 2012
http://www.caos.inf.uc3m.es/~jperalta/ICTSF/

if you know of any more sites offering data mining / forecasting / machine learning competitions please leave a comment!  Thanks.