So we finally have some news from Apple regarding their machine learning SL llms in terms of what they’ve finally been developing. Apple has introduced a multimodal AI system that is pretty impressive because it does actually exceed GPT 4’s capabilities in some regards. This might be the scenario that many have been looking at when they say that GPT 4 is no longer the king. Let’s take a look at exactly what Apple has introduced and how good this new multimodal AI system really is. Let’s take a look at how this system works. It’s called feret, so this is essentially the feret model and it’s based on the research by Apple researchers who created it. Essentially, it’s mainly a vision model. First, it uses a tool called clip viit l14 to understand what’s in the picture and then turns it into a form the computer can work with. Secondly, it also looks at the words you give it and converts them into a format it can understand. Then, it identifies areas in the image. If you talk about a specific part of the picture, like a cat in the bottom left-hand corner, the model uses special coordinates to find exactly where that is in the image. Of course, we do have processing and shapes features, and it’s really smart in dealing with different shapes in the picture, not just simple boxes. It looks at many points in the area you’re talking about and understands the details and locations of each point. Finally, it brings together this information to accurately find and describe the specific part of the picture you’re talking about.
Essentially, what we have here is a really impressive advanced image identification model that, when on certain benchmarks compared to GPT 4, does exceed GPT 4’s vision capabilities. So, you can see here, first of all, there are some benchmarks that you may want to look at. On the benchmarks for the feret model, we can see that feret actually has all of the input types, which are point, box, and free form. It also has very good output grounding, which essentially just means that it can understand exactly the relationship between certain objects in the image and what they actually do in the real physical world. Then, of course, we have on data construction and GPT generate and robustness, and of course, the quantitative evaluation of refer SL ground with chat. This is very interesting because, in this section of the paper, they didn’t actually compare it to GPT 4 with vision. They compared it to GPT 4 Roi. But later on in the paper, I will show you that compared to GPT 4 with vision.
If we take a look at GPT 4 Roi, we can see here that it says GPT 4 Roi instruction tuning large language model on the region of interest. Essentially, what GPT 4 Roi was was specifically a fine-tuned version. In the benchmarks of the PDF, I’m guessing that the researchers likely tested against GPT 4 Roi instead of GPT 4 vision. GPT 4 Roi is specifically designed for understanding and interacting with the regions of interest in images, which is a more advanced and specialized task than what GPT 4 vision might be designed for. GPT 4 Roi’s ability to combine language and detailed image analysis, especially focusing on specific areas within images, makes it a more suitable benchmark for testing the feret model’s capabilities in fine-grained multimodal understanding and interaction. This comparison helps to highlight the advancement and specific strengths of the feret model in handling complex vision tasks.
In the paper, they actually did say that, on the other hand, GPT 4 vision is more knowledgeable in common sense. For example, it can further highlight that the exhaust pipe can reduce the noise. GPT 4’s enhanced linguistic capabilities are much more advanced. In regard to grounding, feret does excel at identifying most traffic lights even in cluttered scenes. Nevertheless, feret shines, especially when precise bounding boxes for grounding are needed, catering to those applications that require pinpoint accuracy in smaller regions.
If we compare GPT 4 Vision to Apple’s new multimodal feret model, it’s clear that feret excels in accurately identifying small and specific regions in images, particularly in complex scenarios. GPT 4 can recognize areas outlined in red or specific in text but tends to struggle with smaller regions. Whereas GPT 4 vision is knowledgeable and effective in general knowledge question answering related to the image regions, feret stands out for its precision in pinpointing small areas, filling the crucial gap in detailed image analysis.
Furthermore, Apple has been actively acquiring a range of artificial intelligence companies in recent years with the aim of enhancing the AI and machine learning capabilities of its products and services. These acquisitions have allowed Apple to tap into the expertise and technology of these companies to develop advanced AI and machine learning capabilities for a range of applications. One such feature is the rumored Apple GPT, a language model similar to GPT 3, which aims to enhance Siri’s virtual assistant capabilities and other AI-powered features on Apple’s products. With a heavy focus on machine learning, Apple is committed to staying ahead of the curve in the technology industry, driving innovation and pushing the boundaries of what’s possible with this technology.
In conclusion, Apple’s advancements in machine learning and AI have made significant strides in recent years, with the introduction of the feret model showcasing their commitment to pushing the boundaries of what’s possible in AI. With the rumored Apple GPT on the horizon, it’s clear that Apple is not resting on its laurels but instead continuously striving to be at the forefront of technology and innovation in the AI space. Exciting times lie ahead for Apple and AI enthusiasts as we anticipate the groundbreaking developments that will continue to shape the future of AI technology.