Data is the most critical aspect when it comes to training AI algorithms. Following the saying "garbage in - garbage out", it is important to control for data quality from the very beginning on. Therefore, it is not surprising that about 80% of all time spent on developing a new AI solution is allocated to data gathering, data cleaning and data labeling.
Long story short, the more data you can offer, the better the project will succeed. The key to AI is the large amount of data. Research shows that on small datasets, traditional forecasting methodologies such as multiple regressions work similarly well as AI solutions. However, once big data comes into play, a competitive advantage can be achieved by using neural networks and other AI methodologies.
But where to find data? Combine internal and external data sources. When gathering your data, don't forget the free data sources such as:
- Open datasets prepared by enthusiasts
- Google search and data parsing
- Cooperation with laboratories, or industry organizations that might see potential in partnering with you
- Web scraping
- National and international institutions (e.g. governments)
To prepare a dataset for machine learning, you need to label the data or, in other words, provide the correct context so that the AI model can learn from the data. The data labeling procedures depend on the AI use case and can vary significantly. Let's simplify this through three examples:
- If you intend to use AI to distinguish between dogs and cats in photos, you should be ready to gather a large set of images with different background, lighting conditions, angles and for each of them indicate whether it contains a dog or cat (=label them).
- If you want to detect the presence of a tumor in MRI scans, evaluate where you can access a rather big number of scans. How many scans you need depends on various questions such as whether you want the algorithm to feature the tumor on the x-ray or not. Then label the scans with a special software to create actionable insights. Through the labeled scans, the algorithm ultimately learns how a tumor looks and can identify them accordingly.
- Adding AI-functionality to the scoring procedure of a consumer's creditworthiness would require having the database of borrowers labelled based on whether the debt has been paid off already or the customer is in arrears.
You may complete these tasks by yourself or hire a Data Labeling Quality Specialist or Data Scientist to ensure the success of your project.