Several tools for ML-competitors

Machine learning competitions are a relatively new phenomenon.
It appeared due to the development of artificial intelligence technologies.
At the moment, it is very actively developing and attracts a lot of interested people.
The advantages that the organizers of the competition receive:
– A large number of qualified people who work on their task and try to solve it better than others
– A relatively small (in comparison with the hiring of professionals) of the financial costs
– The solution of the problem, the most high-quality and suitable for it
And the contestants also benefit:
– Public recognition of high qualifications
– Cash prizes
– And just the pleasure of participating and winning
In this article, I want to look at several tools that can help participants organize the process better and more efficiently, increase the probability of winning, and generally become a more qualified specialist.
Determined
Platform for training deep learning models.
– Accelerated model learning, using state-of-the-art distributed learning, without changing the model code
– Automatic search for high-quality models, with advanced hyperparameter settings — from the creators of Hyperband
– Smart planning of your GPU usage and reduce the cost of cloud GPUs by using preemptible instances
– Track and reproduce experiments, including code versions, metrics, checkpoints, and hyperparameters
– Easy integration with popular DL frameworks
– Allows you to spend more time creating models than managing the infrastructure
Compose
Machine learning tool for automated forecasting.
– Structuring forecasting tasks and creating tags for supervised learning
– Search for training examples based on the final desired result set by the markup function
– Passing the result to Featurepools for automated feature design
– Passing the result to EvalML for automated machine learning
Featuretools
Framework for automated feature design.
– Converting temporary and relational datasets to feature matrices
– Ability to automatically generate feature descriptions in English
EvalML
AutoML library for creating, optimizing, and evaluating machine learning pipelines using domain-specific objective functions.
– In combination with Featuretools and Compose, you can create end-to-end ML solutions for supervised learning
Pandas Profiling
Creates profile reports from the Pandas DataFrame.
– Instead of df.describe () — function df.profile_report ()
– Quick data analysis
– Interactive HTML report with columns
– Type inference: Defining types
– Basics: type, unique values, missing values
– Quantile statistics: minimum, Q1, median, Q3, maximum, range, interquartile range
– Descriptive statistics: mean, mode, standard deviation, sum, mean absolute deviation, coefficient of variation, kurtosis, skewness
– Most common meaning
– Histogram
– Correlations of strongly dependent variables: Spearman, Pearson, and Kendall matrices
– Matrix of missing values: quantity, heat map, and dendrogram
– Text analysis: categories (uppercase letters, space), encoding (Latin, Cyrillic), and blocks (ASCII) in text data
– File and image analysis: file sizes, creation dates, truncated images, and images containing EXIF
Tpot
Machine learning tool that optimizes pipelines using genetic programming.
– Automates the most tedious part of machine learning by intelligently exploring thousands of possible pipelines to find the best of your data
– After the search is complete, provides the Python code for the best found pipeline.
– Made on the basis of Scikit-learn
Shap
Game-theoretic approach to explaining the results of any ML model.
– Has an exact algorithm for an ensemble of trees
– Can be used in deep learning models
Feature-engine
Library with multiple feature transformers for use in ML models.
– Allows you to select the variables you want to convert
– Transformers for missing data, categorical variables, sampling, variable transformations, outliers, creating and selecting variables
Lale
Library for semi-automatic data processing and selecting an algorithm for configuring hyper-parameters.
– Makes better automation, validation, and compatibility
– For automation — high-level interface of pipeline search tools
– To verify correctness — using a JSON schema to detect mismatch errors between hyperparameters and their type or between data and an operator
– For compatibility — growing library of converters and ratings from other popular libraries
Biome
Tool for working with unstructured data.
– Automatic classification — short and noisy texts, long texts; tools for monitoring and analyzing classification results; easy-to-use annotation user interface; pre-configured and extensible classifiers
– Data extraction-tabular data, long documents; built-in ready-made objects (date, time, quantity, weight, size, units of measurement), support for multiple file formats (PDF, Word, Excel, HTML, E-mail or plain text); customizable objects, attributes and relationships; relational output of objects, relationships, roles and attributes based on knowledge graphs
– Comparison-customizable semantic similarity services for sentences, paragraphs, and text content in databases; analytical user interfaces for finding the most similar and dissimilar elements
DataSketch
Tool for probabilistic data structures.
– Process and search large amounts of data very quickly
– Very small loss of precision
PyTextRank
Tool for working with text.
– Extract the most popular phrases from text documents
– Performing low-cost extracting summation of text documents
– Output of links from unstructured text to structured data
– Support for linking objects
– Graph algorithms (in particular, the centrality of eigenvectors)
– Building a lemma graph to represent links between phrases and the supporting language
– Inclusion of verbs in the graph (but not in the resulting phrases)
– The use of pre-treatment with the division of nouns and recognition of named objects
– Extraction summarization on the basis of the ranked phrases
Joblib
Set of tools for easy pipeline creation.
– Simple parallel computing
– Transparent function caching and lazy re-evaluation
– Optimized for fast and reliable processing of large data and arrays
– Convenient re-restart of experiments
– Separation of flow execution logic from domain logic and code
– Parallel helper — makes it easier to write readable parallel code and debug it
– Replacement of Pickle for working with objects containing big data
Shampoo
Tre-processing algorithm based on the structure.
– Faster performance than other optimizers
– Supports a set of pre-prepared matrices that operate in one dimension while shrinking in the others
– Has guarantees of convergence in a stochastic convex situation
Michelangelo
Uber’s machine learning platform.
– Ensuring a continuous workflow
– Centralized feature storage
– Distributed learning infrastructure
– Evaluation and visualization of models with decision trees
– Model deployment tools
– Prediction and routing
– API for connecting pipelines
Hasty.ai
Tool for creating image labels.
– Fast data markup
– Automating the markup process
– Training of the helping model right during the markup
– Search for possible errors
Cortex
Tool for large-scale workloads.
– Deploy models as a real-time or batch API
– High availability with availability zones and automatic instance restarts
– Logical output of on-demand instances or spot instances with on-demand backups
– Autoscaling for processing production workloads with support for redundant query allocation
Weights & Biases
Set of tools for machine learning.
– Tracking experiments
– Hyperparameter optimization
– Versioning of models and datasets
– Toolbar — view the experiment in real time
– Optimize models with a scalable hyper-parameter search tool
– Artifact tracking — save all the details of a continuous pipeline
– Joint documents — research of results and exchange of conclusions
SpeedRun
Set of tools for deploying and managing ML experiments.
– Read configuration files and manage experiment directories
– Logging in Weights & Biases
– Setting up and running the hyperparameters using Weights & Biases
– Write text or images to a file, progress indicators
– Convert matplotlib shapes to images
– Visualization of multidimensional images
– Waiting for running processes to finish and resources to be released
Great Expectations
Working with data — testing, documenting, and profiling.
– Automatic data documentation
– Generating documentation from tests
– Automatic data profiling
Keras Tuner
Platform for optimizing hyperparameters.
– Defining the search space
– Search for the best values
– Built-in Bayesian optimization algorithms
NanoEdge AI Studio
Desktop application for AI libraries, designed for developers of embedded applications and MCU C code.
– Search for the best libraries for embedded projects
– Enabling machine learning capabilities in MCU C code
– Run libraries on any Arm Cortex-M microcontrollers and optimized for them
– Very small model memory size (1–20kB RAM / Flash)
– Ultra fast models (1–20ms output on M4 80MHz)
– Automatic data quality check
– Automatic search for the best AI model
– Real-time data collection and import via serial port
– Emulator for testing the library before embedding
– Easy deployment of C libraries
– Models can be trained directly, without using the MCU
– No ML experience or expertise is required to create and deploy models.
LabelBox
End-to-end platform for creating and managing high-quality data.
– Automated markup
– Shared workspace for working with data and collective interaction between internal and external teams
– Track activity and work progress
– Access and role management
– API (Python, GraphQL) and SDK
– Working with images: classification, recognition and segmentation
– Working with video: powerful video editor, tags on video up to 30 FPS with frame level, tag feature analysis
– Working with text: classification, named entity recognition, support for complex ontologies with built-in classifications
– Pre-marking based on models and active learning
– Prioritizing the queue for marking the most important data using the API
LabelML
Organization of ML experiments and monitoring of the learning process from a mobile device.
– Easy integration (2 lines of code)
– Storing the experiment log, including git-commits, settings and hyperparameters
– Storing the Tensor board log
– Control panel in the local browser
– Storage of checkpoints
– API for custom visualization
PyCaret
Low-code ML library.
– Fast process — from data preparation to model deployment
– Focus on business tasks instead of coding
– Easy to use and build a complete experiment process
– Model performance analysis (more than 60 graphs)
– Data preparation (missing values, transforming categorical data, creating features, configuring hyperparameters of the model)
– Support for the Boruta algorithm
CometML
Tool for quickly creating models.
– Track, compare, explain, and optimize experiments and models
– Fast integration
– Comparison of experiments — code, hyperparameters, metrics, predictions, dependencies, system metrics
– Debugging models — view, analyze, get information and visualize data
– Workspace for team interaction
ClearML
Solution for combining ML tools (MLOps).
– One set of tools for automating the preparation, execution, and analysis of experiments
– Experiment management — parameters, tasks, artifacts, metrics, debugging data, metadata, and logs
– Management and orchestration of GPU / CPU resources, automatic scaling on cloud and on-premises machines
– Data storage — versioning analysis; creating and automating data pipelines; rebalancing, mixing, and combining datasets
Favourable environment
Creates comfort, convenience, pleasantness, warmth and promotes creative inspiration
– Room with a pleasant atmosphere
– Classical music
– Great mood
ConclusionOf course, just description of the tools is not enough to always win.
Success depends on many other factors — knowing where and when to use or not to use a particular tool, what restrictions exist, how to combine tools, etc., etc.
I hope that nevertheless this article will be useful for you and your participation in the competition will become more fruitful and effective.
Forward to victories!Vitaliy Lyalin
Conclusion

Forward to victories!

Footer