pandas is a library for operating with table-like structures. It comes with a powerful DataFrame object, which is a multi-dimensional array object for efficient numerical operations similar to NumPy's ndarray with additional functionalities.
Blaze extends the usability of NumPy and Pandas to distributed and out-of-core computing. Blaze provides an interface similar to that of the NumPy ND-Array or Pandas DataFrame but maps these familiar interfaces onto a variety of other computational engines like Postgres or Spark.
scrapy is an open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.
pattern is a web mining module for Python. It has tools for:
- Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM parser
- Natural Language Processing: part-of-speech taggers, n-gram search, sentiment analysis, WordNet
- Machine Learning: vector space model, clustering, classification (KNN, SVM, Perceptron)
- Network Analysis: graph centrality and visualization.
dragline is a distributed python web crawling framework.
sqlitebrowser is a high quality, visual, open source tool to create, design, and edit database files compatible with SQLite
dtcleaner produces multi-target decision trees for the purpose of data cleaning. It's built for detecting erroneous tuples in the dataset based on given set of conditional functional dependencies (CFDs) and building a classification model to predict erroneous tuples such that the "cleaned" dataset satisfies the CFDs, and semantically correct.