Full Text
Based on the standard active learning process (and assuming the `_generate_training_examples` function is intended to select `questions_per_cycle` from the unlabeled set to add to the training set):
Changing the value of `questions_per_cycle` significantly impacts the active learning process and, consequently, the accuracy curve over time and potentially the final achieved accuracy. Here's how:
1. **Rate of Training Data Growth:** `questions_per_cycle` determines how many new labeled instances are added to the training set in each active learning cycle.
* A **larger `questions_per_cycle`** means the training set grows faster per cycle.
* A **smaller `questions_per_cycle`** means the training set grows slower per cycle.
2. **Granularity of Active Learning:** The size of `questions_per_cycle` affects how frequently the model is retrained and how granular the sample selection process is.
* A **larger `questions_per_cycle`** leads to fewer active learning cycles for a given pool of unlabeled data. The model is retrained less often with larger batches of new data. The acquisition function gets fewer opportunities to update its understanding of the data distribution and uncertainty landscape based on the latest model state.
* A **smaller `questions_per_cycle`** leads to more active learning cycles. The model is retrained more often with smaller batches of new data. The acquisition function can more frequently leverage the current state of the model to select the *most* informative samples available at that precise moment, potentially leading to a more efficient exploration of the uncertain regions of the data space.
3. **Impact on Accuracy:**
* **Speed of initial convergence:** A **larger `questions_per_cycle`** will generally lead to faster initial improvements in accuracy because the model receives a larger influx of new labeled data in the early stages. It might reach a moderate level of accuracy quickly.
* **Potential for peak accuracy:** A **smaller `questions_per_cycle`** *can* potentially lead to a higher *final* or *peak* accuracy, especially if the acquisition function is effective. By querying fewer, potentially more impactful samples in each step, the active learning process can guide the model's learning more strategically. This fine-grained approach might help the model learn critical distinctions or cover diverse uncertain regions more effectively over many cycles, compared to adding a larger, potentially less curated batch.
* **Computational cost:** Retraining the model is often the most expensive part of an active learning cycle. A **larger `questions_per_cycle`** means fewer retraining steps are needed to consume the unlabeled pool, which can be computationally more efficient overall, despite training on larger datasets in each step. A **smaller `questions_per_cycle`** requires more retraining steps, increasing computational cost but potentially improving the efficiency of data acquisition itself (getting more "bang for your buck" in terms of information gain per labeled sample).
In summary, `questions_per_cycle` represents a trade-off between the speed of reaching a certain accuracy level and the potential to achieve a higher maximum accuracy through more strategic, iterative sample selection.
* **Higher `questions_per_cycle`**: Faster, less granular selection, potentially lower peak accuracy.
* **Lower `questions_per_cycle`**: Slower, more granular selection, potentially higher peak accuracy (if acquisition is good), but higher computational cost due to more retraining steps.