So we finally have some news from Apple regarding their machine learning SL llms in terms of what they’ve finally been developing. Apple has introduced a multimodal AI system that is pretty impressive because it does actually exceed GPT 4’s capabilities in some regards. This might be the scenario that many have been looking at when they say that GPT 4 is no longer the king. Let’s take a look at exactly what Apple has introduced and how good this new multimodal AI system really is. Let’s take a look at how this system works. It’s called feret, so this is essentially the feret model and it’s based on the research by Apple researchers who created it. Essentially, it’s mainly a vision model. First, it uses a tool called clip viit l14 to understand what’s in the picture and then turns it into a form the computer can work with. Secondly, it also looks at the words you give it and converts them into a format it can understand. Then, it identifies areas in the image. If you talk about a specific part of the picture, like a cat in the bottom left-hand corner, the model uses special coordinates to find exactly where that is in the image. Of course, we do have processing and shapes features, and it’s really smart in dealing with different shapes in the picture, not just simple boxes. It looks at many points in the area you’re talking about and understands the details and locations of each point. Finally, it brings together this information to accurately find and describe the specific part of the picture you’re talking about.

Essentially, what we have here is a really impressive advanced image identification model that, when on certain benchmarks compared to GPT 4, does exceed GPT 4’s vision capabilities. So, you can see here, first of all, there are some benchmarks that you may want to look at. On the benchmarks for the feret model, we can see that feret actually has all of the input types, which are point, box, and free form. It also has very good output grounding, which essentially just means that it can understand exactly the relationship between certain objects in the image and what they actually do in the real physical world. Then, of course, we have on data construction and GPT generate and robustness, and of course, the quantitative evaluation of refer SL ground with chat. This is very interesting because, in this section of the paper, they didn’t actually compare it to GPT 4 with vision. They compared it to GPT 4 Roi. But later on in the paper, I will show you that compared to GPT 4 with vision.

If we take a look at GPT 4 Roi, we can see here that it says GPT 4 Roi instruction tuning large language model on the region of interest. Essentially, what GPT 4 Roi was was specifically a fine-tuned version. In the benchmarks of the PDF, I’m guessing that the researchers likely tested against GPT 4 Roi instead of GPT 4 vision. GPT 4 Roi is specifically designed for understanding and interacting with the regions of interest in images, which is a more advanced and specialized task than what GPT 4 vision might be designed for. GPT 4 Roi’s ability to combine language and detailed image analysis, especially focusing on specific areas within images, makes it a more suitable benchmark for testing the feret model’s capabilities in fine-grained multimodal understanding and interaction. This comparison helps to highlight the advancement and specific strengths of the feret model in handling complex vision tasks.

In the paper, they actually did say that, on the other hand, GPT 4 vision is more knowledgeable in common sense. For example, it can further highlight that the exhaust pipe can reduce the noise. GPT 4’s enhanced linguistic capabilities are much more advanced. In regard to grounding, feret does excel at identifying most traffic lights even in cluttered scenes. Nevertheless, feret shines, especially when precise bounding boxes for grounding are needed, catering to those applications that require pinpoint accuracy in smaller regions.

If we compare GPT 4 Vision to Apple’s new multimodal feret model, it’s clear that feret excels in accurately identifying small and specific regions in images, particularly in complex scenarios. GPT 4 can recognize areas outlined in red or specific in text but tends to struggle with smaller regions. Whereas GPT 4 vision is knowledgeable and effective in general knowledge question answering related to the image regions, feret stands out for its precision in pinpointing small areas, filling the crucial gap in detailed image analysis.

Furthermore, Apple has been actively acquiring a range of artificial intelligence companies in recent years with the aim of enhancing the AI and machine learning capabilities of its products and services. These acquisitions have allowed Apple to tap into the expertise and technology of these companies to develop advanced AI and machine learning capabilities for a range of applications. One such feature is the rumored Apple GPT, a language model similar to GPT 3, which aims to enhance Siri’s virtual assistant capabilities and other AI-powered features on Apple’s products. With a heavy focus on machine learning, Apple is committed to staying ahead of the curve in the technology industry, driving innovation and pushing the boundaries of what’s possible with this technology.

In conclusion, Apple’s advancements in machine learning and AI have made significant strides in recent years, with the introduction of the feret model showcasing their commitment to pushing the boundaries of what’s possible in AI. With the rumored Apple GPT on the horizon, it’s clear that Apple is not resting on its laurels but instead continuously striving to be at the forefront of technology and innovation in the AI space. Exciting times lie ahead for Apple and AI enthusiasts as we anticipate the groundbreaking developments that will continue to shape the future of AI technology.

Leah Sirama, a lifelong enthusiast of Artificial Intelligence, has been exploring technology and the digital realm since childhood. Known for his creative thinking, he's dedicated to improving AI experiences for all, making him a respected figure in the field. His passion, curiosity, and creativity drive advancements in the AI world.


  1. when siri came out it was the best, dough bad, i can imagine apple could make siri again the best as implementing it with the apple apps is easier than if google has to implement it with samsung apps and oppo and so on…also everyone uses their own different android launcher. and theres a gazilian androids and most of them have too weak a processor to use Ai whereas all iphones of last 3 years have some Ai chip

  2. But what we have to state clearly: Yes apple is maybe behind Samsung and Samsung has things first. But! Apple don’t wanna do things first – apple wanna do things right. I had Samsung etc phones for ages and had soooo many problems! Now is switched to Apple and never had a single problem since what apple releases – works!

  3. Apple has no business being in AI as far as public interest goes. Apple as a company has attempted and succeeded in a number of instances in monopolizing aspects of it's industry. A powerful monopolized AI system controlled by a powerful corporate tech company such as Apple would guarantee a stranglehold on nearly everything we do in our day to day and would be even harder to stop or regulate.I find Apple to be the prime example of corporate fascism within the tech industry. AI needs to be transparent to all, owned by none and taught by humans with loving hearts, smart and healthy minds and that value humanitarian pursuits above all. Not human beings filled with greed to meet their own self interest in a grossly unregulated capitalistic society. If anyone finds AI to be frightening, remember that AI is only the reflection of our own frightening selves that taught and built AI. Everyone needs to accept and see their own flaws and shortcomings and find healthy ways of fixing that. Possibly through therapy that works on mental health and emotional wellbeing. Then at that point we may be worthy of training AI. That needs to happen sooner than later for the sake of us all.

  4. I think of all the exciting things AI can do my main concern is Hallucination. Even the smartest person in the room who understands AI to its deepest core realizes this is a very dangerous issue

  5. Why nobody consider bard ? 😂
    It actually understood what the what the shock absorber is (even if I posted a shitty picture taken form the TV …)
    Apparently Bard vision is pretty good

  6. Why not ask GPT the exact same question that the other language models were asked in reference to the bike shock absorber? You said highlighted which can refer to where the picture receives the most light since the box is not a highlight. The original question did not say highlighted. It is enclosed in a yellow box ie [region0] in the same color font as the box in question and lets you compare the results on par honestly by using the exact same question. So copy and paste using same font color for [region0]…What is the purpose of the object [region0] on the bike? The same applies to the [region0] and [region1] in red font enclosed in red ovals. Using the same font color and the same text within the same question used on the actual image might have improved results like it did with ferret.

    Using white font over white background makes it invisible which lessens the effect intended for emphasis for the sake of the video. Also, proof reading videos before releasing them so they are not filled with text errors that were not the same as what you said will only strengthen your reputation. Choosing that emphasis method to emphasize your entire speech instead of just the part needing actual emphasis filled with text errors was a poor way to relay the information. You could have just talked and not used the font and it would have been clean, better and less distracting. "Ladies and gentlemen, Apple has finally to make their entrance into the generative AR space." Everything that followed was a mess.

  7. Why do they have a woman 👠♀️ as chat gpt and its not even real robot 🤖 just an image they used. I get that its not year 3,000 but i think the images should be made into real robots not just an image.

  8. No doubt Apple is having problems something people have been talking about since October, "Apple shares fell more than 3% after Barclays downgraded the stock and trimmed its price target, saying weakening iPhone 15 sales were likely a warning sign for iPhone 16 sales and broader hardware projections." – CNBC

  9. I think of all the exciting things AI can do my main concern is Hallucination. Even the smartest person in the room who understands AI to its deepest core realizes this is a very dangerous issue. There must be a real way to mitigate this. Also, how will the various AI’s models compete against each other. In the end Evil versus Good. I think the most important part of AI is finding stratigic medical applications, treatments and ultimately cures. Aside from this my other worry is how AI would affect military operations. I’m sure when this is applied to battlefield combat/defense it would be of great concern. Lots to ponder over.

  11. This is a horrible hodgepodge of stitched together previous videos. This channel has always come across as AI generated. However, this video, I would call AI degraded! BTW, I’m a fan of AI, but only when it enhances things. If I were in charge of this channel, I would take this video down immediately. It’s not flattering to the company.

  12. Can you please just proofread the subtitles in your videos? They're sloppy and full of typos. Your videos are helpful but I'm going to unsubscribe because they bug me and make me doubt your info.

  13. To be fair, GPT4 can at least 1-shot that motorcycle suspension question 🙂 Shot 0 it identifies the muffler. Tell it the muffler is below the box and ask it to try again to identify what's inside the box, and it was spot on at that point.

  14. AI/Synthetic/Biologic/Humanoid/Robotoid Clones/?/ & DoD/Sentient World Simulation/?/ & Smart Dust/Motes/Micro-Electromechanical Sensors/Tagging/Tracking system/?/ – duck duck go!

  15. 4:50, it would be nice if we can send the same image/prompt to multiple different Vision GPT models At The Same Time to get different outputs At The Same Time rather than just having 1.

  16. Even my very basic problem solver gpt gave me the answer easily, The highlighted region on the motorcycle is the shock absorber or suspension system. Its primary purpose is to absorb and dampen shock impulses from the road, which helps to ensure that the motorcycle's wheels stay in contact with the road surface for better traction, control, and comfort, providing a smoother ride. The suspension system also protects the motorcycle and the rider from the potential damage and discomfort caused by rough terrain.

    Confidence Score: 100%

  17. Apple, like X and Meta, actually has a data, hardware, and software moat. Given the recent NYT lawsuit, it seems like having access to large, company owned, multi-media data may be what wins the race.

