deequ

Data inspector

A library for testing data quality in large datasets

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

GitHub

3k stars
81 watching
539 forks
Language: Scala
last commit: about 1 month ago
Linked from 3 awesome lists

dataqualityscalasparkunit-testing

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
awslabs/python-deequ A Python API for defining unit tests for data quality in large datasets 730
databricks/koalas A Python package that allows users to work with pandas DataFrames on top of Apache Spark 3,336
dmmiller612/sparktorch A PyTorch implementation on Apache Spark for distributed deep learning model training and inference. 339
databricks/learning-spark Examples and tutorials for learning Spark using Java and Scala 3,890
spiritlab/spark A research-focused implementation of Apache Spark with homomorphic encryption support 3
spark-notebook/spark-notebook An interactive web-based editor for exploring and analyzing large datasets using Scala, Apache Spark, and other data science tools 3,151
mrpowers-io/spark-fast-tests A testing helper library for Apache Spark applications. 436
johnsnowlabs/spark-nlp Provides a set of pre-trained models and libraries for natural language processing tasks on top of Apache Spark 3,871
apache/spark An analytics engine designed to handle large-scale data processing and analysis 39,916
dotnet/spark Provides high-performance APIs for using Apache Spark with .NET 2,023
datastax/spark-cassandra-connector A library that enables integration between Apache Spark and Apache Cassandra for fast data processing and analysis. 1,943
tofgarion/spark-by-example An adaptation of ACSL by Example for SPARK 2014 to verify Ada programs with formal methods 152
yaooqinn/itachi A library that brings useful functions from various modern database management systems to Apache Spark 56
databricks/spark-corenlp Wraps Stanford CoreNLP annotators as Spark DataFrame functions for natural language processing tasks 422
apache/jmeter A tool used to simulate heavy loads on servers and measure their performance under different conditions. 8,413