In the final instalment of our AI blog series, we look at big data projects and the benefits that our approach to data science can bring your business.
What happens if the data has an increased size?
Big data projects are usually developed on Amazon Web Services, Microsoft Azure and Google Cloud Platform. The cloud providers offer many options for implementing intelligent features in enterprise applications that don’t require deep knowledge of AI or ML theory or a team of data scientists. However, for customised ML business needs, the need for ML knowledge is essential. The cost over time is also significantly reduced if in-house ML expertise is employed.
For projects which involve processing a large volume of data, you may consider the cloud-enabled solutions from the first stage in a data science project. However, conducting AI modelling can be time-consuming if the size of the data is large, and the cost incurred by the cloud for training AI models with multiple configurations can trigger a prohibitive cost.
By employing the Amdaris in-house data science accelerator, the AI specialists can subsample the original data so that only a representative sample of the entire original business data is used. The sampling approach enables the data scientist to develop the AI models on local machines. After the optimal parameters of the AI models have been determined, the final training of the AI models can benefit from the full available data by performing training on cloud resources.
But does the training of optimised models locally involve a new stage of code development for the cloud provider platform?
The answer is no, no extra effort has to be spent on coding AI models on the cloud platform, as the Amdaris in-house AI accelerator can run on the cloud without modifying the ML-related code. Furthermore, the same ETL code developed on a local machine can run on the cloud platform for processing large quantities of data.
How can Amdaris help?
We can help your data science projects by:
- reducing the time and cost of bringing AI pipelines at production-grade performance and stability on the cloud by employing the in-house data science accelerator.
- allowing the same data science code to run on the cloud infrastructure.
There is no need for additional time to adapt the AI pipeline to the cloud. In particular, the framework tailored around the business problem is run in Microsoft Azure Databricks. Through this approach, the data science project is combining the advantages of highly customised AI algorithms and cloud deployment without the high cost of cloud experts.
It is interesting to note that cloud-based AI pipelines that use Azure Databricks can scale automatically as data volume increases. In generic terms, a computing cluster is a set of connected nodes (computers) that work together to solve a task faster. When configuring the computing resources for Azure Databricks, a cluster has to be created and configured with the number of nodes and other parameters, as shown in the figure below.
Configuration of a computing cluster for Azure Databricks
The “Enable autoscaling” check-box is usually checked if there are periods of high computational demands. The dropdown “Worker Type” is selecting the configuration of a single computing node. The important fact when configuring clusters for Databricks is the distinction between node and worker. The node is a general-purpose machine, while a worker is a special type of Spark entity named executor. An executor is an in-memory process working as a fault-tolerant component in the Spark job processing paradigm.
It may be observed that Standard mode clusters require at least one Spark worker node in addition to the driver node to execute Spark jobs. Focusing deeper on the topic of deploying a data science pipeline in production in the cloud, a first candidate solution is to use the Spark ecosystem, with all the complexity that it involves and the additional costs of configuring the cluster to be “Spark capable”, i.e. add at least one Spark worker node.
But why not simply use the benefits of computing clusters without Spark for a more robust solution with a significantly reduced deployment and running cost?
The data science accelerator built-in Python is the answer to this question. Deploying a Python-based processing pipeline on Databricks without using the Spark ecosystem is the key to a solid, efficient, cost-effective and reliable AI production-ready solution.
Proposed AI solution
The workflow of the proposed production-grade AI system is shown in the figure above. The Azure Databricks is connected to various Azure resources for running independently the ETL, model training and inference. The tasks executed in Python by Azure Databricks can be scheduled to run automatically, using a job configured inside Databricks or it can be triggered by the individuals in the user group with adequate privileges. The data is retrieved from an Azure SQL database and processed by the Python-based ETL, using the Pandas version of the DataFrame structures, not the Spark version. The trained ML models are stored inside the Azure Blob Storage container. Databricks will connect to the Azure Blob Storage and SQL database by using an Azure KeyVault, which stores various secrets that are needed by the AI pipeline to run without human input.
Benefits of using data science in production using the Amdaris data science framework via Azure Databricks
- Possibility of integrating with the Azure Blob Storage and Azure Data Lake Store for retrieving, storing, retrieve and updating data.
- The Databricks AI pipeline can be used as a component in the Azure Data Factory.
- Harvesting the benefits of keeping critical code parameters, such as server port numbers, inside Azure KeyVaults to avoid including plain secrets in the code.
- It has an Azure Synapse Analytics connector, as well as the ability to connect to Azure DB.
- For straightforward authentication, it integrates with the company’s Azure Active Directory.
- Maintain code in a repository and Spin up a cluster automatically using Azure DevOps and Continuous Integration (CI) / Continuous Delivery (CD).
Conclusion
For business across industries, focusing more on understanding the business decision it needs to influence and less on the technology it uses has a huge benefit in the medium to long term. Although migrating the data and software to the cloud is a key part of enhanced efficiency, the reality for most organisations is that a hybrid solution is almost always going to provide the greatest returns to the business. In essence, adopting AI techniques for various process optimisation directly developed using cloud technologies will increase the running and maintenance costs of these intelligent features.
Crafting a vision towards customised AI solutions developed by human specialists in collaboration with the organisation’s management will create a solid foundation for adopting smart, automatic decision-making modules. The proposed solution of running a Python-based ML pipeline on cloud technologies is capable of offering a high degree of customization of the solution that the business needs, eliminating the additional cost and overhead of using solely cloud technologies for AI-enabled features.
What next?
If you would like to speak to someone at Amdaris about AI solutions, just get in touch.
Call +44(0)117 935 3444 or contact us using the form below and let us know about your next plans. We will help you choose the best technology for making your project a success.