unstructured

Data pipeline library

A toolkit for building custom machine learning pipelines from unstructured data

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

GitHub

9k stars
60 watching
755 forks
Language: HTML
last commit: 6 days ago
Linked from 1 awesome list

data-pipelinesdeep-learningdocument-image-analysisdocument-image-processingdocument-parserdocument-parsingdocxdonutinformation-retrievallangchainllmmachine-learningmlnatural-language-processingnlpocrpdfpdf-to-jsonpdf-to-textpreprocessing

Backlinks from these awesome lists:

Related projects:

Repository Description Stars
ml-tooling/opyrator Automates conversion of machine learning code into production-ready microservices with web API and GUI. 3,102
llmware-ai/llmware A framework for building enterprise LLM-based applications using small, specialized models 6,651
fastapi/fastapi A modern Python framework for building high-performance RESTful APIs with automatic interactive documentation and robust standards-based features. 77,670
gradio-app/gradio Enables rapid creation and deployment of web applications for machine learning models and functions using Python 33,962
lightly-ai/lightly An open-source framework for self-supervised learning on images using deep learning techniques. 3,165
towhee-io/towhee A framework for building efficient neural data processing pipelines using large language models and state-of-the-art deep learning models. 3,226
instructor-ai/instructor A Python library that provides structured outputs from large language models (LLMs) and facilitates seamless integration with various LLM providers. 8,163
pipedreamhq/pipedream An integration platform for automating data flows between applications and services. 8,981
explosion/spacy Industrial-strength NLP library for Python and Cython 30,230
pypi/warehouse A software system that powers the package registry for Python packages 3,601
gokumohandas/made-with-ml Teaches machine learning fundamentals and software engineering practices for building production-ready ML applications 37,603
juhaku/utoipa Generates OpenAPI documentation from Rust API code 2,474
mlflow/mlflow A platform to manage the entire machine learning lifecycle, from experiment tracking to model deployment. 18,781
kedro-org/kedro A toolbox for production-ready data science pipelines with software engineering best practices for reproducibility and modularity 10,004
alibaba/pipcook An open-source machine learning platform for web developers, providing a modular framework for building and deploying models 2,543