Skip to content

Vitruves/nail-parquet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

nail - Lightning-Fast Data Analysis CLI

nail is a high-performance command-line tool for analyzing, transforming, and exploring data files at blazing speed. Built with Rust, Apache Arrow, and DataFusion, it handles Parquet, CSV, JSON, and Excel files with ease—perfect for data engineers, analysts, and scientists who need quick insights without loading heavy tools.

🚀 Why nail? Process gigabyte-scale datasets in seconds • SQL-powered operations • Zero configuration • Works offline • Single binary

Crates.io Downloads License Rust

nail_parquet

Features

  • Fast operations on large datasets using Apache Arrow and DataFusion
  • Multiple file formats supported: Parquet, CSV, JSON, and Excel
  • Comprehensive data operations: inspection, statistics, filtering, sampling, transformations
  • Data quality tools: search, deduplication, size analysis, missing value handling
  • Advanced features: joins, unions, schema manipulation, stratified sampling
  • File optimization: compression, sorting, and encoding for better performance
  • Data analysis tools: binning, pivot tables, correlation analysis
  • Flexible output: console display or file output in multiple formats
  • Production-ready with robust error handling and verbose logging

System Requirements

  • Operating System: Linux (Ubuntu 24.04+ recommended), macOS, Windows
  • Memory: 4GB+ RAM (8GB+ recommended for large datasets)
  • Storage: SSD recommended for large file operations
  • Dependencies:
    • Darwin: none
    • Linux: pkg-config and openssl (package names might vary depending on your distro)

Use Cases

nail is the perfect tool for:

  • Parquet file viewer - Quick inspection and exploration without Spark or Pandas
  • CSV to Parquet converter - Fast format conversion with automatic compression
  • Excel data analysis - Analyze .xlsx files without Excel or LibreOffice
  • Command-line data science - Perfect for SSH environments and automation scripts
  • Offline data tool - No cloud, no server dependencies - works completely offline
  • ETL pipelines - Transform and validate data in shell scripts and CI/CD
  • Data quality checks - Validate schemas, detect outliers, find duplicates quickly
  • Quick statistics - Get descriptive stats and correlations in seconds
  • Large file processing - Handle gigabyte-scale datasets that crash spreadsheet tools
  • Pandas alternative - 10x faster for read-only analysis tasks on large files

Installation

cargo install nail-parquet

or

# From source
git clone https://github.com/Vitruves/nail-parquet
cd nail-parquet
cargo build --release
sudo cp target/release/nail /usr/local/bin/

# Verify installation
nail --help

Global Options

All commands support these global flags:

  • -v, --verbose - Enable verbose output with timing and progress information
  • -j, --jobs N - Number of parallel jobs (default: half of available CPU cores)
  • -o, --output FILE - Output file path (prints to console if not specified)
  • -f, --format FORMAT - Output format: json, csv, parquet, text (auto-detect by default)
  • -h, --help - Display command help

Commands

Data Inspection

nail describe

Show comprehensive global file overview with metadata, dimensions, column types, and data quality metrics.

# Display file overview with colored output
nail describe data.parquet

# Include verbose logging
nail describe data.parquet --verbose

Output includes:

  • File metadata (path, format, size, timestamps)
  • Dimensions (rows, columns, estimated memory)
  • Column type distribution (numeric, string, date/time, boolean)
  • Data quality metrics (density, null values, duplicates)
  • Storage efficiency
  • Column name listings by type

nail head

Display the first N rows of a dataset.

# Basic usage
nail head data.parquet

# Display first 10 rows
nail head data.parquet -n 10

# Save to JSON file
nail head data.parquet -n 5 -o sample.json

# Verbose output with timing
nail head data.parquet -n 3 --verbose

Options:

  • -n, --number N - Number of rows to display (default: 5)

nail tail

Display the last N rows of a dataset.

# Display last 5 rows
nail tail data.parquet

# Display last 20 rows with verbose logging
nail tail data.parquet -n 20 --verbose

# Save last 10 rows to CSV
nail tail data.parquet -n 10 -o tail_sample.csv

Options:

  • -n, --number N - Number of rows to display (default: 5)

nail preview

Randomly sample and display N rows from the dataset. Supports both static display and interactive browsing mode.

# Random preview of 5 rows
nail preview data.parquet

# Reproducible random sample with seed
nail preview data.parquet -n 10 --random 42

# Interactive mode for browsing records one by one
nail preview data.parquet --interactive

Options:

  • -n, --number N - Number of rows to display (default: 5)
  • --random SEED - Random seed for reproducible results
  • -I, --interactive - Interactive mode with scrolling (use arrow keys, q to quit)

Interactive Mode Controls:

  • ←/→ or h/l - Navigate between records (previous/next)
  • ↑/↓ or k/j - Navigate between fields within a record
  • q, Esc - Quit interactive mode
  • Ctrl+C - Force quit

nail headers

List column names, optionally filtered by regex patterns.

# List all column headers
nail headers data.parquet

# Filter headers with regex
nail headers data.parquet --filter "^price.*"

# Save headers to file
nail headers data.parquet -o columns.txt

# JSON format output
nail headers data.parquet -f json

Options:

  • --filter REGEX - Filter headers with regex pattern

nail schema

Display detailed schema information including column types and nullability.

# Display schema
nail schema data.parquet

# Save schema to JSON
nail schema data.parquet -o schema.json

# Verbose schema analysis
nail schema data.parquet --verbose

nail metadata

Display detailed Parquet file metadata, including schema, row groups, column chunks, compression, encoding, and statistics.

# Display basic metadata
nail metadata data.parquet

# Show all available metadata
nail metadata data.parquet --all

# Show detailed schema and row group information
nail metadata data.parquet --schema --row-groups --detailed

# Save all metadata to JSON
nail metadata data.parquet --all -o metadata.json

Options:

  • --schema - Show detailed schema information
  • --row-groups - Show row group information
  • --column-chunks - Show column chunk information
  • --compression - Show compression information
  • --encoding - Show encoding information
  • --statistics - Show statistics information
  • --all - Show all available metadata
  • --detailed - Show metadata in detailed format

nail size

Analyze file and memory usage with detailed size breakdowns.

# Basic size analysis
nail size data.parquet

# Show per-column size breakdown
nail size data.parquet --columns

# Show per-row analysis
nail size data.parquet --rows

# Show all size metrics
nail size data.parquet --columns --rows

# Raw bits output (no human-friendly formatting)
nail size data.parquet --bits

# Save size analysis to file
nail size data.parquet --columns --rows -o size_report.txt

Options:

  • --columns - Show per-column sizes
  • --rows - Show per-row analysis
  • --bits - Show raw bits without human-friendly conversion

nail search

Search for specific values across columns with flexible matching options.

# Basic search across all columns
nail search data.parquet --value "John"

# Search in specific columns
nail search data.parquet --value "error" -c "status,message,log"

# Case-insensitive search
nail search data.parquet --value "ACTIVE" --ignore-case

# Exact match only (no partial matches)
nail search data.parquet --value "complete" --exact

# Return row numbers only
nail search data.parquet --value "Bob" --rows

# Save search results
nail search data.parquet --value "error" -o search_results.json

Options:

  • --value VALUE - Value to search for (required)
  • -c, --columns PATTERN - Comma-separated column names to search in
  • --rows - Return matching row numbers only
  • --ignore-case - Case-insensitive search
  • --exact - Exact match only (no partial matches)

Data Quality Tools

nail outliers

Detect and optionally remove outliers from numeric columns using various methods.

# Detect outliers using IQR method for specific column
nail outliers data.parquet -c "price" --method iqr

# Detect outliers using Z-score with a custom threshold, showing values
nail outliers data.parquet -c "revenue" --method z-score --z-score-threshold 2.5 --show-values

# Remove outliers using Modified Z-score method from multiple columns
nail outliers data.parquet -c "age,income" --method modified-z-score --remove -o cleaned_data.parquet

# Detect outliers using Isolation Forest method (simplified)
nail outliers data.parquet -c "score" --method isolation-forest

Options:

  • -c, --columns PATTERN - Comma-separated column names or regex patterns for outlier detection
  • --method METHOD - Outlier detection method: iqr, z-score, modified-z-score, isolation-forest (default: iqr)
  • --iqr-multiplier VALUE - IQR multiplier for outlier detection (default: 1.5)
  • --z-score-threshold VALUE - Z-score threshold for outlier detection (default: 3.0)
  • --show-values - Show outlier values instead of just flagging them
  • --include-row-numbers - Include row numbers in output
  • --remove - Remove outliers from dataset and save cleaned data

Statistics & Analysis

nail stats

Compute statistical summaries for numeric and categorical columns with flexible percentile calculations and type filtering.

# Basic statistics (mean, Q25, Q50, Q75, unique count)
nail stats data.parquet

# Exhaustive statistics
nail stats data.parquet --stats-type exhaustive

# Statistics for specific columns
nail stats data.parquet -c "price,volume,quantity"

# Statistics with regex column selection
nail stats data.parquet -c "^(price|vol).*" --stats-type exhaustive

# Custom percentiles
nail stats data.parquet -c "revenue" --percentiles "0.1,0.5,0.9,0.95,0.99"

# Numeric columns only
nail stats data.parquet --numeric-only

# Categorical columns only
nail stats data.parquet --categorical-only

# Save statistics to file
nail stats data.parquet --stats-type basic -o stats.json

Options:

  • -c, --columns PATTERN - Comma-separated column names or regex patterns
  • -t, --stats-type TYPE - Statistics type: basic, exhaustive, hypothesis (default: basic)
  • -p, --percentiles VALUES - Custom percentiles (comma-separated, e.g., '0.1,0.5,0.9')
  • --numeric-only - Include only numeric columns
  • --categorical-only - Include only categorical (string) columns

Statistics Types:

  • basic: count, mean, quartiles (q25, q50, q75), and number of unique values.
  • exhaustive: count, mean, std dev, min, max, variance, duplicates, and unique values.
  • hypothesis: statistical significance tests (not yet implemented).

nail correlations

Compute correlation matrices between numeric columns with optional statistical significance testing.

# Basic Pearson correlation
nail correlations data.parquet

# Specific correlation types
nail correlations data.parquet --type kendall
nail correlations data.parquet --type spearman

# Correlations for specific columns
nail correlations data.parquet -c "price,volume,quantity"

# Output as correlation matrix format
nail correlations data.parquet --matrix

# Include statistical significance tests (p-values for fisher, t-test, chi-sqr)
nail correlations data.parquet --tests fisher_exact,t_test

# Comprehensive correlation analysis with significance tests
nail correlations data.parquet --tests fisher_exact -o correlations.json

Options:

  • -c, --columns PATTERN - Comma-separated column names or regex patterns
  • -t, --type TYPE - Correlation type: pearson, kendall, spearman (default: pearson)
  • --matrix - Output as correlation matrix format
  • --tests - Include statistical significance tests (fisher_exact, chi_sqr, t_test)
  • --digits N - Number of decimal places for correlation values (default: 4)

nail frequency

Compute frequency tables for categorical columns showing value counts, distributions, and percentages.

# Basic frequency table for a single column
nail frequency data.parquet -c "category"

# Multiple columns frequency analysis
nail frequency data.parquet -c "category,status,region"

# Save frequency table to file
nail frequency data.parquet -c "product_type" -o frequency_table.csv

# Verbose output with progress information
nail frequency data.parquet -c "category,status" --verbose

Options:

  • -c, --columns PATTERN - Comma-separated column names to analyze (required).

Output: Shows frequency counts with percentages for each value, helping identify data distribution patterns.

Data Manipulation

nail select

Select specific columns and/or rows from the dataset.

# Select specific columns
nail select data.parquet -c "id,name,price"

# Select columns with regex
nail select data.parquet -c "^(id|price).*"

# Select specific rows
nail select data.parquet -r "1,5,10-20"

# Select both columns and rows
nail select data.parquet -c "id,price" -r "1-100"

# Save selection to new file
nail select data.parquet -c "id,name" -o subset.parquet

Options:

  • -c, --columns PATTERN - Column names or regex patterns (comma-separated)
  • -r, --rows SPEC - Row numbers or ranges (e.g., "1,3,5-10")

nail drop

Remove specific columns and/or rows from the dataset.

# Drop specific columns
nail drop data.parquet -c "temp_col,debug_info"

# Drop columns matching pattern
nail drop data.parquet -c "^temp_.*"

# Drop specific rows
nail drop data.parquet -r "1,5,100-200"

# Drop both columns and rows
nail drop data.parquet -c "temp_col" -r "1-10"

Options:

  • -c, --columns PATTERN - Column names or regex patterns to drop
  • -r, --rows SPEC - Row numbers or ranges to drop

nail filter

Filter data based on column conditions or row characteristics.

# Filter by column conditions
nail filter data.parquet -c "price>100,volume<1000"

# Multiple conditions with different operators
nail filter data.parquet -c "age>=18,status=active,score!=0"

# String matching and numeric comparisons
nail filter data.parquet -c "name!=test,salary<=50000,active=true"

# Filter to numeric columns only
nail filter data.parquet --rows numeric-only

# Remove rows with NaN values
nail filter data.parquet --rows no-nan

# Remove rows with zeros
nail filter data.parquet --rows no-zeros

# String columns only
nail filter data.parquet --rows char-only

Options:

  • -c, --columns CONDITIONS - Column filter conditions (comma-separated). Supported operators:
    • = (equals), != (not equals)
    • > (greater than), >= (greater or equal)
    • < (less than), <= (less or equal)
    • Examples: age>25, status=active, price<=100
  • --rows FILTER - Row filter type: no-nan, numeric-only, char-only, no-zeros

nail fill

Fill missing values using various strategies.

# Fill with specific value
nail fill data.parquet --method value --value 0

# Fill specific columns with value
nail fill data.parquet -c "price,quantity" --method value --value -1

# Fill with mean (for numeric columns)
nail fill data.parquet --method mean

# Fill with median
nail fill data.parquet --method median -c "price,volume"

Options:

  • --method METHOD - Fill method: value, mean, median, mode, forward, backward (default: value)
  • --value VALUE - Fill value (required for 'value' method)
  • -c, --columns PATTERN - Comma-separated column names to fill

nail rename

Rename one or more columns.

# Rename a single column
nail rename data.parquet --column "old_name=new_name" -o renamed.parquet

# Rename multiple columns
nail rename data.parquet --column "id=user_id,val=value" -o renamed.parquet

Options:

  • -c, --column SPECS - Column rename specs (before=after), comma-separated.

nail create

Create new columns with expressions based on existing columns.

# Create a single new column
nail create data.parquet --column "total=price*quantity" -o enhanced.parquet

# Create multiple columns
nail create data.parquet --column "total=price*quantity,margin=(price-cost)" -o enhanced.parquet

# Filter rows while creating columns
nail create data.parquet --column "category_score=score*2" --row-filter "score>50" -o filtered_enhanced.parquet

Options:

  • -c, --column SPECS - Column creation specifications (name=expression), comma-separated.
  • -r, --row-filter FILTER - Row filter expression to apply before creating columns.

nail dedup

Remove duplicate rows or columns from the dataset.

# Remove duplicate rows (all columns considered)
nail dedup data.parquet --row-wise

# Remove duplicate rows based on specific columns
nail dedup data.parquet --row-wise -c "id,email"

# Keep last occurrence instead of first
nail dedup data.parquet --row-wise --keep last

# Remove duplicate columns (by name)
nail dedup data.parquet --col-wise

# Save deduplicated data
nail dedup data.parquet --row-wise -o clean_data.parquet

Options:

  • --row-wise - Remove duplicate rows (conflicts with --col-wise)
  • --col-wise - Remove duplicate columns (conflicts with --row-wise)
  • -c, --columns PATTERN - Columns to consider for row-wise deduplication
  • --keep STRATEGY - Keep 'first' or 'last' occurrence (default: first)

Data Sampling & Transformation

nail sample

Sample data using various strategies.

# Random sampling
nail sample data.parquet -n 1000

# Reproducible random sampling
nail sample data.parquet -n 500 --method random --random 42

# Stratified sampling
nail sample data.parquet -n 1000 --method stratified --stratify-by category

# First N rows
nail sample data.parquet -n 100 --method first

# Last N rows
nail sample data.parquet -n 100 --method last

Options:

  • -n, --number N - Number of samples (default: 10)
  • --method METHOD - Sampling method: random, stratified, first, last (default: random)
  • --stratify-by COLUMN - Column name for stratified sampling
  • --random SEED - Random seed for reproducible results

nail shuffle

Randomly shuffle the order of rows in the dataset.

# Random shuffle
nail shuffle data.parquet

# Reproducible shuffle (Note: DataFusion's RANDOM() may not be deterministic)
nail shuffle data.parquet --random 42

# Shuffle and save to new file
nail shuffle data.parquet -o shuffled.parquet --verbose

Options:

  • --random SEED - Random seed for reproducible results

nail sort

Sort data by one or more columns with flexible sorting strategies and null handling.

# Sort by all columns (auto-detect data types)
nail sort data.parquet

# Sort by specific columns
nail sort data.parquet -c "price,date"

# Sort with specific strategies
nail sort data.parquet -c "date,amount,name" -s "date,numeric,alphabetic"

# Descending sort
nail sort data.parquet -c "revenue" -d true

# Multiple columns with mixed directions
nail sort data.parquet -c "category,price" -d "false,true"

# Handle nulls differently
nail sort data.parquet -c "score" --nulls first
nail sort data.parquet -c "rating" --nulls skip

# Case-insensitive alphabetic sorting
nail sort data.parquet -c "name,category" -s "alphabetic" --case-insensitive

# Sort dates with custom format
nail sort data.parquet -c "date" -s "date" --date-format "mm-dd-yyyy"

# Sort time values with custom format
nail sort data.parquet -c "timestamp" -s "hour" --hour-format "hh:mm:ss"

Options:

  • -c, --column COLUMNS - Columns to sort by (comma-separated or 'all') (default: all)
  • -s, --strategy STRATEGIES - Sort strategy per column: numeric, date, alphabetic, alphabetic-numeric, numeric-alphabetic, hour, auto (comma-separated)
  • -d, --descending FLAGS - Sort in descending order (comma-separated true/false per column)
  • --nulls HANDLING - Null value handling: first, last, skip (default: last)
  • --date-format FORMAT - Date format pattern (e.g., 'mm-dd-yyyy', 'dd/mm/yyyy', 'yyyy-mm-dd')
  • --hour-format FORMAT - Time format pattern (e.g., 'hh:mm:ss', 'mm:ss')
  • --case-insensitive - Case-insensitive alphabetic sorting

Sort Strategies:

  • auto: Automatically detect based on data type
  • numeric: Sort numerically (converts strings to numbers if needed)
  • date: Sort by date (supports custom formats)
  • alphabetic: Sort alphabetically (supports case-insensitive)
  • alphabetic-numeric: Sort alphabetically first, then numerically
  • numeric-alphabetic: Sort numerically first, then alphabetically
  • hour: Sort by time/hour values

nail id

Add a unique ID column to the dataset.

# Add simple numeric ID column
nail id data.parquet --create

# Add ID with custom name and prefix
nail id data.parquet --create --id-col-name record_id --prefix "REC-"

# Save with new ID column
nail id data.parquet --create -o data_with_ids.parquet

Options:

  • --create - Create new ID column
  • --prefix PREFIX - Prefix for ID values (default: "id")
  • --id-col-name NAME - ID column name (default: "id")

Data Combination

nail diff

Compare two datasets and show differences using key-based or row-based comparison.

# Compare files using key columns
nail diff old_data.parquet --compare new_data.parquet --keys "id,timestamp"

# Show only changed rows
nail diff v1.parquet --compare v2.parquet --keys "id" --changes-only

# Show rows only in left file
nail diff current.parquet --compare archive.parquet --keys "record_id" --left-only

# Show rows only in right file
nail diff baseline.parquet --compare updated.parquet --keys "user_id" --right-only

# Row-by-row positional comparison (no keys)
nail diff file1.csv --compare file2.csv

# Save diff results
nail diff old.parquet --compare new.parquet --keys "id" -o differences.parquet

Options:

  • -c, --compare FILE - Second file to compare with (required)
  • -k, --keys COLUMNS - Columns to use as primary key for comparison (comma-separated)
  • --changes-only - Show only rows that differ
  • --left-only - Show only rows in left file
  • --right-only - Show only rows in right file

Output:

  • diff_status column indicates: ADDED, REMOVED, or MODIFIED
  • For key-based: Shows matching records with left/right values
  • For row-based: Shows records by position with left/right values

nail merge

Join two datasets horizontally based on a common key column.

# Inner join (default)
nail merge left.parquet --right right.parquet --key id -o merged.parquet

# Left join - keep all records from left table
nail merge customers.parquet --right orders.parquet --left-join --key customer_id -o customer_orders.parquet

# Right join - keep all records from right table
nail merge orders.parquet --right customers.parquet --right-join --key customer_id -o order_customers.parquet

# Merge on columns with different names
nail merge table1.parquet --right table2.parquet --key-mapping "table1_id=table2_user_id"

Options:

  • --right FILE - Right table file to merge with (required)
  • --key COLUMN - Join key column name (if same in both tables)
  • --key-mapping MAPPING - Join key mapping for different column names (left_col=right_col)
  • --left-join - Perform left join
  • --right-join - Perform right join

nail append

Append multiple datasets vertically (union operation).

# Append files with matching schemas
nail append base.parquet --files "file1.parquet,file2.parquet" -o combined.parquet

# Append with verbose logging
nail append base.parquet --files "jan.parquet,feb.parquet,mar.parquet" --verbose -o q1_data.parquet

# Force append with schema differences (fills missing columns with nulls)
nail append base.parquet --files "different_schema.parquet" --ignore-schema -o combined.parquet

Options:

  • --files FILES - Comma-separated list of files to append (required)
  • --ignore-schema - Ignore schema mismatches and force append

nail split

Split dataset into multiple files based on ratios or stratification.

# Split by ratio
nail split data.parquet --ratio "0.7,0.3" --names "train,test" --output-dir splits/

# Stratified split
nail split data.parquet --ratio "0.8,0.2" --stratified-by category --output-dir splits/

# Reproducible split
nail split data.parquet --ratio "0.6,0.2,0.2" --random 42 --output-dir splits/

Options:

  • --ratio RATIOS - Comma-separated split ratios (must sum to 1.0 or 100.0)
  • --names NAMES - Comma-separated output file names
  • --output-dir DIR - Output directory for split files
  • --stratified-by COLUMN - Column for stratified splitting
  • --random SEED - Random seed for reproducible splits

File Optimization

nail optimize

Optimize Parquet files by applying compression, sorting, and encoding techniques.

# Basic optimization with default compression (snappy)
nail optimize input.parquet -o optimized.parquet

# Optimize with specific compression type and level
nail optimize input.parquet -o optimized.parquet --compression zstd --compression-level 5

# Sort data while optimizing
nail optimize input.parquet -o optimized.parquet --sort-by "timestamp,id"

# Enable dictionary encoding for better compression
nail optimize input.parquet -o optimized.parquet --dictionary

# Comprehensive optimization and validation
nail optimize input.parquet -o optimized.parquet --compression zstd --sort-by "date,category" --dictionary --validate --verbose

Options:

  • --compression TYPE - Compression type: snappy, gzip, zstd, brotli (default: snappy)
  • --compression-level LEVEL - Compression level (1-9, default: 6)
  • --sort-by COLUMNS - Comma-separated columns to sort by
  • --dictionary - Enable dictionary encoding
  • --no-dictionary - Disable dictionary encoding
  • --validate - Validate optimized file after creation

Data Analysis & Transformation

nail binning

Bin continuous variables into categorical ranges for analysis. Note: Current implementation supports one column at a time.

# Equal-width binning with 5 bins
nail binning data.parquet -c "age" -b 5 -o binned.parquet

# Custom bins with specific edges
nail binning data.parquet -c "score" -b "0,50,80,90,100" --method custom -o binned.parquet

# Add bin labels
nail binning data.parquet -c "temperature" -b 3 --labels "Cold,Warm,Hot" -o binned.parquet

Options:

  • -c, --columns COLUMN - Column to bin (required)
  • -b, --bins BINS - Number of bins or custom edges (e.g., "5" or "0,10,50,100") (default: 10)
  • --method METHOD - Binning method: equal-width, custom (default: equal-width)
  • --labels LABELS - Comma-separated bin labels
  • --suffix SUFFIX - Suffix for new binned column (default: "_binned")
  • --drop-original - Drop original column after binning

nail pivot

Create pivot tables for data aggregation and cross-tabulation. Note: Current implementation is a simplified group-by aggregation and does not create a wide-format pivot table.

# Basic pivot table with sum aggregation
nail pivot data.parquet -i "region" -c "product" -l "sales" -o pivot.parquet

# Pivot with different aggregation functions
nail pivot data.parquet -i "category" -c "product" -l "revenue" --agg mean -o avg_pivot.parquet

Options:

  • -i, --index COLUMNS - Row index columns (comma-separated) (required)
  • -c, --columns COLUMNS - Column pivot columns (comma-separated) (required)
  • -l, --values COLUMNS - Value columns to aggregate (comma-separated)
  • --agg FUNC - Aggregation function: sum, mean, count, min, max (default: sum)
  • --fill VALUE - Fill missing values (default: "0")

Format Conversion & Utility

nail convert

Convert between different file formats.

# Convert Parquet to CSV
nail convert data.parquet -o data.csv

# Convert CSV to Parquet
nail convert data.csv -o data.parquet

# Convert Parquet to Excel
nail convert data.parquet -o data.xlsx

# Verbose conversion with progress
nail convert large_dataset.csv -o large_dataset.parquet --verbose

Supported Formats:

  • Input: Parquet, CSV, JSON, Excel (xlsx)
  • Output: Parquet, CSV, JSON, Excel (xlsx)

nail count

Count the number of rows in a dataset.

# Basic row count
nail count data.parquet

# Count with verbose output
nail count data.parquet --verbose

nail update

Check for newer versions of the nail tool.

# Check for updates
nail update

# Check with verbose output
nail update --verbose

Examples

Basic Data Exploration

# Quick dataset overview with comprehensive file info
nail describe sales_data.parquet

# Traditional exploration
nail schema sales_data.parquet
nail size sales_data.parquet --columns --rows
nail head sales_data.parquet -n 10
nail stats sales_data.parquet --stats-type basic

# Advanced statistics with custom percentiles
nail stats sales_data.parquet -c "revenue,profit" --percentiles "0.25,0.5,0.75,0.90,0.95,0.99"

# Column inspection
nail headers sales_data.parquet --filter "price"
nail correlations sales_data.parquet -c "price,quantity,discount" --stats-tests t_test

# Frequency analysis for categorical data
nail frequency sales_data.parquet -c "category,region,status"

Data Quality Investigation

# Comprehensive data quality overview
nail describe data.parquet

# Compare datasets to find differences
nail diff yesterday.parquet --compare today.parquet --keys "id" --changes-only

# Search for problematic values
nail search data.parquet --value "error" --ignore-case
nail search data.parquet --value "null" -c "critical_fields" --rows

# Find and remove duplicates
nail dedup data.parquet --row-wise -c "id" --verbose -o unique_data.parquet

# Analyze data size and memory usage
nail size data.parquet --columns --bits

# Check for specific patterns in text fields
nail search data.parquet --value "@gmail.com" -c "email" --exact

# Frequency analysis to identify data quality issues
nail frequency data.parquet -c "status" --verbose

Data Enhancement and Transformation

# Create new calculated columns
nail create sales_data.parquet --column "total=price*quantity,profit=(price-cost)" -o enhanced_sales.parquet

# Rename columns for clarity
nail rename enhanced_sales.parquet --column "total=total_sale,profit=net_profit" -o final_sales.parquet

# Create complex derived metrics
nail create customer_data.parquet --column "lifetime_value=orders*avg_order" -o customer_metrics.parquet

# Filter and enhance in one step
nail create large_dataset.parquet --column "score=performance*weight" --row-filter "active=true" -o active_scored.parquet

Data Optimization and Processing Pipeline

# 1. Optimize raw data files for better performance
nail optimize raw_data.parquet -o optimized_data.parquet --compression zstd --sort-by "timestamp,customer_id" --dictionary --verbose

# 2. Create analytical features with binning
nail binning optimized_data.parquet -c "age" -b "18,25,35,50,65" --method custom --labels "18-24,25-34,35-49,50-64,65+" -o aged_data.parquet

# 3. Group data for analysis (using pivot's group-by functionality)
nail pivot aged_data.parquet -i "age_binned" -c "category" -l "revenue" --agg sum -o age_revenue_summary.parquet

# 4. Add derived metrics
nail create age_revenue_summary.parquet --column "avg_revenue=sum_revenue/count_revenue" -o enhanced_summary.parquet

# 5. Statistical analysis of optimized data
nail stats enhanced_summary.parquet --stats-type exhaustive -o summary_stats.json

Performance Tips

  1. Use Parquet for large datasets - Parquet is columnar and much faster than CSV for analytical operations.
  2. Specify column patterns - Use -c with regex patterns to operate only on relevant columns.
  3. Chain operations - Use intermediate files for complex multi-step transformations.
  4. Adjust parallelism - Use -j to control parallel processing based on your system.
  5. Enable verbose mode - Use --verbose to monitor performance and progress on large datasets.

Error Handling

nail provides detailed error messages for common issues:

  • File not found: Clear indication of missing input files
  • Schema mismatches: Detailed information about incompatible schemas in merge/append operations
  • Invalid expressions: Specific feedback on malformed filter conditions or column patterns
  • Memory issues: Graceful handling of large datasets with appropriate error messages

License

MIT License - see LICENSE file for details.

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass
  5. Submit a pull request

Support

For issues and questions:

About

Fast parquet command line tool with many functions, nailed it!

Resources

License

Stars

Watchers

Forks

Packages

No packages published