- Introduction
- Architecture Overview
- Technologies Used
- Project Structure
- Development Pipeline
- Setup and Execution
- Testing
- License
This project introduces an End-to-End MLOps pipeline for building a machine learning solution focused on crop prediction. The core objective is to recommend the most suitable crop type based on soil parameters, enabling better agricultural decision-making.
The emphasis of this project is on MLOps best practices, such as the automation of machine learning workflows, reproducible pipelines, version control for models, and deployment-ready APIs.
- Build a fully automated pipeline covering data processing, model training, and deployment.
- Deploy a RESTful API to make model predictions available in a containerized environment on AWS.
Note: The dataset is intentionally simple. The focus of this project is not model performance or complex data, but the design of a robust and well structured architecture that follows best practices across the entire machine learning lifecycle.
The project architecture is modular to ensure scalability and maintainability. The following components are included in the pipeline:
-
Data Exploration and Processing:
Initial exploration of soil datasets, data cleaning, and preprocessing.
-
Feature Engineering:
Features are preprocessed after the Data Exploration.
-
Model Training and Experiment Tracking:
Training the model using the Pycaret benchmark as a base, with the experiments being tracked using MLFlow.
-
Model Deployment:
The trained model is provided as a RESTful API, allowing external services to request predictions.
-
Continuous Integration/Deployment:
Using GitHub Actions for running tests and deploying services.
Below is the general solution diagram that represents the system flow:
- Scikit-Learn
- PyCaret
- MLFlow
- FastAPI
- Uvicorn
- Docker
- Docker Compose
- Pandas
- Seaborn
- Pydantic
- GitHub Actions
- Pytest
- Pylint
- isort
- Black
The project is organized as follows:
end-to-end-mlops/
│
├── notebooks/
│ └── Data exploration and experimentation notebooks.
│
├── data/
│ └── Datasets used for training and testing.
│
├── config/
│ └── Configuration files for the pipeline (e.g., YAML, JSON).
│
├── model/
│ └── Trained models, saved checkpoints, and model artifacts.
│
├── AI_reports/
│ └── Reports created by the AI agent.
│
├── src/
│ ├── Preprocessing scripts.
│ ├── Training pipeline.
│
├── tests/
│ └── Unit and integration test scripts.
│
├── Dockerfile
│ └── Instructions to containerize the API and training scripts.
│
├── compose.yaml
│ └── Docker Compose file that orchestrates services (API, training, MLFlow).
│
├── app.py
│ └── FastAPI script serving trained models as a REST API.
│
├── pyproject.toml
│ └── Lists project dependencies and environment configuration.
│
└── README.md
└── Overview of the project (current file).
The development lifecycle is divided into clear stages as follows:
- Exploration: The raw soil dataset is analyzed and explored.
- Preprocessing: Features are normalized (e.g., Z-score normalization), and categorical variables are encoded using LabelEncoder.
- Numerical characteristics are standardized and categorical are processed.
- PyCaret is used to rapidly benchmark models and evaluate performance metrics.
- Training the best model using the Pycaret benchmark as a base.
- MLFlow Autologging is enabled to log all hyperparameters, training results, and performance metrics for every experiment.
- Manual logs are added for preprocessing steps and specific configurations.
- The trained model is deployed as an API using FastAPI.
- The API exposes endpoints for making predictions with the trained model.
- All services are containerized using Docker for easy distribution.
- Deployed in AWS EC2 using Github Actions
Note: If you want to deploy to AWS, first set the variables EC2_SSH_KEY, REMOTE_HOST, REMOTE_USER in the GitHub Actions Secrets before triggering the workflow. Otherwise it will fail.
- Python 3.8+
- Docker and Docker-Compose
-
Clone the repository:
git clone https://github.com/Hotarouuu/end-to-end-mlops.git cd end-to-end-mlops -
Build containers:
docker compose up --build
-
Access the API documentation: Visit http://localhost:8000/docs where you can test the endpoints.
Automated testing ensures the pipeline functions reliably. The following testing layers are included:
-
Unit Tests:
Validate individual components like preprocessing, feature engineering, and model inference functions. -
Integration Tests:
Verify the interaction between multiple modules, including end-to-end pipeline workflows.
To execute all tests, run the following:
pytest tests/This project is licensed under the MIT License. See the LICENSE file for more details.
