Skip to content

Latest commit

 

History

History
executable file
·
59 lines (45 loc) · 2.43 KB

data.md

File metadata and controls

executable file
·
59 lines (45 loc) · 2.43 KB

orolo - data manipulation and gathering

Praxis

Technology

Pandas

pandas is a library for operating with table-like structures. It comes with a powerful DataFrame object, which is a multi-dimensional array object for efficient numerical operations similar to NumPy's ndarray with additional functionalities.

Blaze

Blaze extends the usability of NumPy and Pandas to distributed and out-of-core computing. Blaze provides an interface similar to that of the NumPy ND-Array or Pandas DataFrame but maps these familiar interfaces onto a variety of other computational engines like Postgres or Spark.

scrapy

scrapy is an open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.

pattern

pattern is a web mining module for Python. It has tools for:

  • Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM parser
  • Natural Language Processing: part-of-speech taggers, n-gram search, sentiment analysis, WordNet
  • Machine Learning: vector space model, clustering, classification (KNN, SVM, Perceptron)
  • Network Analysis: graph centrality and visualization.

dragline

dragline is a distributed python web crawling framework.

sqlitebrowser

sqlitebrowser is a high quality, visual, open source tool to create, design, and edit database files compatible with SQLite

dtcleaner

dtcleaner produces multi-target decision trees for the purpose of data cleaning. It's built for detecting erroneous tuples in the dataset based on given set of conditional functional dependencies (CFDs) and building a classification model to predict erroneous tuples such that the "cleaned" dataset satisfies the CFDs, and semantically correct.