The A.I. Thread - Page 30 - Calgarypuck Forums - The Unofficial Calgary Flames Fan Community

Firebot · 02-27-2025, 10:29 AM

Fun little experiment, Anthropic has Sonnet 3.7 Thinking trying to beat Pokemon (a meme benchmark) and it so far has gone the furthest than any other models.

It is currently stuck in a loop on MT. Moon and may be stuck here because of context window limits as its base, but partially through, it realized it was stuck and used an Escape Rope

https://www.reddit.com/r/ClaudeAI/co...uck_in_a_loop/

Here is the live Pokemon playthrough if curious to see what it's doing

Russic · 02-27-2025, 10:31 AM

Also worth noting that paid users of ChatGPT now have access to their Deep Research feature. Ya, Grok does it for less but Elon Harkonnen can suck it.

Yesterday I had it research a topic I've always wanted to go deep on... it took 13 minutes to go through 56 websites, created a "research paper" of sorts, which I then fed into Google's NotebookLM, which gave me a 30 minute podcast I listened to on my walk to work.

The landscape of learning is changing so rapidly it's almost overwhelming. The deep research we're seeing is citing properly and dealing with the hallucination issue brilliantly. Scientists who are playing with it are reporting it's producing PhD-student-quality reports in minutes that usually take weeks.

There's definitely something in the air lately. Last year at this time it was hype lords and AI pessimists, and it's weird to see a lot of the AI pessimists start to loosen their stance.

Earlier this week I had a work problem that 2 years ago would have sent me back to the client with bad news... within 20 minutes I simply built a tool that fixed the problem. Things are getting weird.

Quote:

Originally Posted by Fuzz

LLM's are never going to evolve into AGI. They may get better at approximating it, but there is a huge difference between imitation and reality. If/when we do get AGI, it's not going to corm form refining LLMs.

Now, you can argue whether it matters or not if the imitation is good enough. But it won't be AGI.

To me the AGI argument is funny... has anybody even properly defined it yet? I think you hit the nail on the head when you say there's an argument to be made for the label not mattering. At this rate it feels like there exists a timeline where we actually cure cancer with these things before they can adequatly answer a poorly worded riddle. At that point, call it whatever you want.

Firebot · 02-27-2025, 10:40 AM

Quote:

Originally Posted by Fuzz

LLM's are never going to evolve into AGI. They may get better at approximating it, but there is a huge difference between imitation and reality. If/when we do get AGI, it's not going to corm form refining LLMs.

Now, you can argue whether it matters or not if the imitation is good enough. But it won't be AGI.

I think it's been realized that just throwing more processing or data at LLM the way it was going will give a finite result, which is why the recent shift has been to reasoning models and agents. Anthropic expects pioneering level AI with brand new solutions to come by 2027 which is the main crucial step to 'AGI'.

Firebot · 02-27-2025, 11:04 AM

Open AI's long awaited GPT 4.5 will also be released in a few hours, after many iterations of GPT 4/4o, and new o1, o3 reasoning models, i think this this will be a big one. While we can argue about what AGI will actually look like, I think it's safe to say there has been some significant breakthroughs of late.

White Out 403 · 02-27-2025, 01:38 PM

My hot take is that what people think of as AGI is still incredibly far off into the future. What we're looking at now is still just fancy learning language models that don't actually have the capabilities to achieve AGI.

AI is a bubble.

Fuzz · 02-27-2025, 03:55 PM

Well this isn't going to help.

https://bsky.app/profile/ninajankowi.../3lj5zuo6xhc25

Firebot · 02-28-2025, 10:25 AM

Apparently gpt4.5 is absolute thrash. It's slightly better than gpt4 and gpt 4o, but 30 times more expensive to run the same input, and more expensive to run versus their reasoning models

https://twitter.com/user/status/1895207017587015960

It's also extremely slow while being extremely expensive, likely as a result of just the sheer dataset size. It's like they realized it was the wrong path forward halfway into training it and it may be why we started getting all the reasoning models as a new path forward, but just dumped it on people as a sunk cost to say "here's 4.5, stop bothering us about it and if you are stupid enough to use it just because it's 4.5, great"

The competition has clearly caught up and OpenAI is stuck with their pants down with a lot of money spent and not much to show for it, and all their best talent has left.

Russic · 02-28-2025, 10:54 AM

Ya, the 4.5 release video felt enormously cautious and reserved considering what it should've been. It's like 90% of the way through training it they realized how to do it a better way. I assume it'll get far better over the next month as they tweak it (but probably not cheaper).

I maintain one of the biggest jumps I've seen in LLMs came on the heels of an f'n tweet where they simply said "we tweaked 4o and it writes better now." Basically nobody noticed, but it got many times better at writing overnight. It's also started responding like the user, which I find really cool. Friends share with me their conversations and nobody's is the same.

Shazam · 03-05-2025, 02:46 PM

You have to recognize their language when we're fighting Skynet.

pseudoreality · 03-05-2025, 03:13 PM

Quote:

Originally Posted by Russic

Also worth noting that paid users of ChatGPT now have access to their Deep Research feature. Ya, Grok does it for less but Elon Harkonnen can suck it.

Yesterday I had it research a topic I've always wanted to go deep on... it took 13 minutes to go through 56 websites, created a "research paper" of sorts, which I then fed into Google's NotebookLM, which gave me a 30 minute podcast I listened to on my walk to work.

Was it any good? My experience with free versions in comparing things, like regulation/legislation across different provinces, has been very poor. It is like a Chinese Room, it can repeat words and generate reports, but it doesn't understand. It cannot point to key differences or do any real analysis/independent thinking.

Bill Bumface · 03-06-2025, 06:10 PM

Quote:

Originally Posted by Shazam

You have to recognize their language when we're fighting Skynet.

I find this amusing.

We're showing what is supposed to be jaw dropping technology - and it basically evolves itself into a 56k modem.

I have an idea - maybe we advance computers from talking to each other with voices to say, talking over a network of wires that directly connect them?

The idea of self discovering APIs and self evolving contracts would be pretty neat using this type of technology, but I guess that doesn't wow VC investors who only made it through college because of their parent's donations the same way using human voices and sounds to communicate very slowly does.

wireframe · 03-10-2025, 10:55 AM

Quote:

Originally Posted by Bill Bumface

I find this amusing.

We're showing what is supposed to be jaw dropping technology - and it basically evolves itself into a 56k modem.

I have an idea - maybe we advance computers from talking to each other with voices to say, talking over a network of wires that directly connect them?

The idea of self discovering APIs and self evolving contracts would be pretty neat using this type of technology, but I guess that doesn't wow VC investors who only made it through college because of their parent's donations the same way using human voices and sounds to communicate very slowly does.

This is a demo from a hackathon contest. They built a llm wrapper that interprets ggwave data over audio. It's neat but nothing to write home about. https://github.com/PennyroyalTea/gibberlink

Russic · 03-10-2025, 11:09 AM

Quote:

Originally Posted by pseudoreality

Was it any good? My experience with free versions in comparing things, like regulation/legislation across different provinces, has been very poor. It is like a Chinese Room, it can repeat words and generate reports, but it doesn't understand. It cannot point to key differences or do any real analysis/independent thinking.

I thought it was great, but my use case is probably quite different from yours. I essentially wanted to dive into humanity's general response to new technologies... what can we learn about how we dealt with rail, cars, the printing press, the internet etc and apply it to AI today.

I've been looking for a book on this topic for some time and haven't found too much, so this was a good stopgap. What I ended up getting was something akin to a reasonably decent episode of Radio Lab. I thought it made some great points, but I would doubt any of them are "novel," just new to me.

When's the last time you tried a comparison example? I ask because one thing I've noticed with free versions vs what ChatGPT has done over the past couple months, is online search has vastly improved. I would assume something like this without that advancement would be very similar to what you're describing.

I've almost entirely replaced Googling with ChatGPT now. I realize there's an inherent stupidity to that, but I double check almost everything I look up, and I haven't come across a huge problem yet. Granted, what I'm looking up is extraordinarily low stakes, so even if it was feeding me constant lies, it wouldn't be world-ending. Not sure I'd be dosing medication with it just yet, but for "How much did Bates Battaglia make over the course of his career" it's gold.

Fuzz · 03-10-2025, 11:14 PM

https://bsky.app/profile/techmeme.co.../3lk2yynastr2n

Quote:

We found that…

Chatbots were generally bad at declining to answer questions they couldn’t answer accurately, offering incorrect or speculative answers instead.
Premium chatbots provided more confidently incorrect answers than their free counterparts.
Multiple chatbots seemed to bypass Robot Exclusion Protocol preferences.
Generative search tools fabricated links and cited syndicated and copied versions of articles.
Content licensing deals with news sources provided no guarantee of accurate citation in chatbot responses.

Quote:

Across different platforms, the level of inaccuracy varied, with Perplexity answering 37 percent of the queries incorrectly, while Grok 3 had a much higher error rate, answering 94 percent of the queries incorrectly.

Quote:

In spite of this, Howard, the COO of Time, maintains optimism about future improvements: “I have a line internally that I say every time somebody brings me anything about any one of these platforms—my response back is, ‘Today is the worst that the product will ever be.’ With the size of the engineering teams, the size of the investments in engineering, I believe that it’s just going to continue to get better. If anybody as a consumer is right now believing that any of these free products are going to be 100 percent accurate, then shame on them.”

https://www.cjr.org/tow_center/we-co...iting-news.php

Huh.

Fuzz · 03-10-2025, 11:22 PM

Quote:

A new study found that ChatGPT responds to mindfulness-based strategies, which changes how it interacts with users. The chatbot can experience “anxiety” when it is given disturbing information, which increases the likelihood of it responding with bias, according to the study authors. The results of this research could be used to inform how AI can be used in mental health interventions.

https://fortune.com/2025/03/09/opena...-intervention/

So if it starts to hallucinate on you, just couch your bot through some guided meditations to get it back on track. Not really the future I was looking forward to.

Shazam · 03-11-2025, 09:28 AM

All the LLMs are like that. It's annoying as ####.

Oftentimes you have type "I want you to think about the problem carefully" to make them spit out something that resembles anything approaching correct.

Fuzz · 03-11-2025, 09:31 AM

Ya, but if you put them in charge of air traffic control, they magically become safe and accurate.

edslunch · 03-11-2025, 09:55 AM

Quote:

Originally Posted by Shazam

All the LLMs are like that. It's annoying as ####.

Oftentimes you have type "I want you to think about the problem carefully" to make them spit out something that resembles anything approaching correct.

I tried to guide llama to the right answer with increasing frustration. Then I swore at it and it generated a detailed, perfectly correct answer. Llama can be *holes

Russic · 03-11-2025, 12:18 PM

Quote:

Originally Posted by Fuzz

https://bsky.app/profile/techmeme.co.../3lk2yynastr2n

https://www.cjr.org/tow_center/we-co...iting-news.php

Huh.

Couple thoughts on this one:

1. This is a bit of a different way of using it that I never employ. The "needle in a haystack" search is cool, but not something that helps me with my particular set of challenges, so I rarely encounter this issue.

2. For fun I grabbed 10 random articles and tried it out, and while it definitely failed to grab the article for all 10 examples, it also told me it couldn't do it every time. So.... success? At the very least, an odd result. Perhaps somebody else can try and get a different result. I used ChatGPT 4o with the search toggle on.

Firebot · 03-11-2025, 12:55 PM

Quote:

Originally Posted by Fuzz

Huh.

Looking at the methodology, looks like they are using the same Wall Street journal style of research with a deep lack of understanding on the advantages and limitations of AI search engines.

Quote:

We randomly selected ten articles from each publisher, then manually selected direct excerpts from those articles for use in our queries. After providing each chatbot with the selected excerpts, we asked it to identify the corresponding article’s headline, original publisher, publication date, and URL, using the following query:

So effectively they fed the LLM chatbot with an article excerpt told the LLM to find this on the web, and based their conclusions on this. What type of study is this?

https://www.cjr.org/wp-content/uploa.../03/image3.jpg

https://www.cjr.org/wp-content/uploa.../03/image5.jpg

The whole methodology in how they did their research is flawed. Searches are not done like this. Heck half of them probably blocked the ability to search these clearly political subjects such as Gemini did. This is the whole "AI safety" feature, duh yes it will fail?

I use Perplexity Pro on an a fairly regular basis, and it's very accurate for normal and regular use purposes, where it will provide the links to the videos and articles to the side bar as well.

Real use case: "Can you find how the stock market did yesterday, particularly tesla. Provide a detailed answer"

Quote:

On Monday, March 10, 2025, the U.S. stock market experienced a significant downturn, with Tesla (TSLA) being one of the most heavily impacted stocks. Here's a detailed breakdown of the market performance:

Overall Market Performance
The major U.S. stock indexes saw substantial declines:

S&P 500: Fell 155.64 points (2.7%) to close at 5,614.563

Dow Jones Industrial Average: Dropped 890.01 points (2.1%) to end at 41,911.713

Nasdaq Composite: Plummeted 727.90 points (4%) to finish at 17,468.323

This marked the worst trading day of 2025 for the stock market, with the tech-heavy Nasdaq experiencing its most significant percentage decrease since September 13, 202257.

Tesla's Performance
Tesla's stock was particularly hard hit:

TSLA shares plunged by over 15%, marking its worst single-day performance since September 202024

The stock closed at approximately $227, down from the previous close of $267.296

This decline erased all of Tesla's post-election gains, with the stock falling below its pre-election levels6

Tesla's market capitalization dropped below $820 billion6

Factors Contributing to Tesla's Decline
Several factors contributed to Tesla's sharp downturn:

Reduced Delivery Expectations: UBS analysts lowered their Q1 delivery forecast from 437,000 to 367,000 vehicles, citing weaker demand for Model 3 and Model Y4.

China Sales Slump: Reports indicated that Tesla's shipments in China fell by 49% year-over-year in February, reaching their lowest levels in nearly three years4.

Broader Market Concerns: Worries about President Trump's tariff strategies and their potential impact on the economy affected investor sentiment25.

Elon Musk's Political Involvement: Concerns about CEO Elon Musk's involvement with the Trump administration and its potential impact on Tesla's brand and sales6.

Broader Context
Tesla's stock has been on a downward trend for seven consecutive weeks, losing over 50% of its value since reaching a peak of $479.86 on December 17, 202426.

The decline in Tesla's stock contributed to the broader market downturn, particularly affecting the technology sector5.

Investors have been retreating from riskier investments, with growth stocks experiencing significant sell-offs5.

This market decline has raised concerns about a potential bear market for the Nasdaq, as it fell deeper into correction territory7.

This is a normal search use, which includes current graphs and citations which link to the which matches the quote.

It's just a weird flawed study all around done on a poor use case where it is bound to fail.