15 hours ago • code_your_own_AI

Hi community,
you asked how much it costs to fine-tune /maybe align/ a bigger LLM, like a Llama-3-70B.

@abhi1thakur tweeted, that on a single node (8xH100GPU) with HuggingFace's Autotrain it took about 2.5 hours, with this yaml file ( https://github.com/huggingface/autotrain-advanced/blob/main/configs/llm_finetuning/llama3-70b-orpo-v1.yml ) and costs about US$ 200, when done w/ NVIDIA DGX cloud. Just PEFT, no 4-bit quant.

How to install autotrain:  https://github.com/huggingface/autotrain-advanced 

See also  https://twitter.com/abhi1thakur/status/1786680791348506855 

Detailed instructions for Autotrain here:  https://huggingface.co/blog/train-dgx-cloud 

22 hours ago • code_your_own_AI

Hi community,
if you are a green grasshopper and in Vienna next week for ICLR 2024, 
there is a unique opportunity to ask great mentors personally for their advice during the "mentoring chats", daily during 12.45h and 2.15h. 

Stellar line-up of mentors: Aditi Ragunathan, Aleksandra Faust, Alexander Rush, Amy Zhang, Andrea Bajcsy, Andrej Risteski, Andrew Gordon Wilson, Anna Rumshisky, Claire Vernade, Danqi Chen, Erin Grant, Eunsoi Choi, Furong Huang, Hanwang Zhang, Kyunghyun Cho, Lili Mou, Luke Zettlemoyer, Masashi Sugiyama, Mihaela van der Schaar, Moritz Hardt, Piotr Koniusz, Priya L. Donti, Rene Vidal, Rosanne Liu, Samy Bengio, Tatsunori Hashimoto, Xuezhi Wang, Yao Qin, and Yoshua Bengio.

For more info see  https://blog.iclr.cc/2024/05/01/hugging-face-demo-site-2/ 

Smile ... if you want to know what Q are expected:
---------------------------------------------------------------------------------

Should I establish myself as an expert in one area/technique or explore a breadth of topics? Should I master a technique and apply it to different problems, or should I master a subfield by finding all useful techniques?

How to get a good balance between collaborating with other researchers while also distinguishing my own research? Will too much collaboration hurt my job prospects?

Should I present my work in poster sessions and workshops? Should I be scared of getting scooped? What are the pros of presenting my work early?

How should I decide whether to go to academia or industry?

Does it make sense to do a software engineering/development internship if it is not research-related?

Should I do a research internship on a topic different from my dissertation? 

1 day ago • code_your_own_AI

Hi community,
new functional elements discovered in transformers: retrieval heads. 

New study by MIT and Peking Univ.
Retrieval heads impact RAG performance, CoT reasoning and long context length retrieval quality.



These special attention heads are performing the information retrieval of long context, and show exceptional characteristics. Applications are: New dev for improved RAG systems, better long context windows retrieval plus improved causal reasoning, better Chain-of-Thought reasoning (CoT), significantly reduced factual hallucination of RAG and LLM ... and more. 

What are you interested in: 

Interested in new functional elements within transformers

How to improve Information retrieval (IR) systems

I want finally an agent with better RAG perf!

How to stop factual hallucination of my LLM

183 votes

2 days ago • code_your_own_AI

Hi community,
New LLM Leaderboard published (May 1, 2024) with REKA-Flash-21B-online on position 13 (congratulations).
Also on position 13: Mistral - Large and Llama 3 8B Instruct (congrats!). 

One subscriber asked, where is the new Snowflake - Arctic - Instruct LLM? Why is a brand-new LLM with 480B (!) not under the Top 3? 

... by the way: Top1 is GPT-4-Turbo (as always) and 2x Top2: Claude 3 Opus and Gemini 1.5 Pro (smile)

Snowflake - Arctic - Instruct 480B (MoE 128x3.66B) w/ more than 28600 votes is officially 
at position 37 
(behind an MoE Mixtral - 8x7B - Instruct at pos 31) in the latest benchmark data by the AI community, as published on LMsys.org leaderboard on May 1. 

See attached images from LMSYS.org. 

3 days ago • code_your_own_AI

Hi community,
asked about what the mysterious gpt2-chat is, I can speculate, that it would be clever to hide the new Gemini Ultra under a naming convention from your competitor, as if it fails, it doesn't reflect on you and if it excels, now you know that people are gonna love it and you take it offline when you have enough data, especially when your image is a little bit .... 

Or you would watch my latest video and compare the three answers by gpt2-chatbot with the other tested models from GPT-4 to Gemini 1.5 Pro. 

Hope I have contributed in a non-boring way to ongoing gpt2-chatbot speculations and hype. Smile. 

4 days ago (edited) • code_your_own_AI

Hi community,
new video - new blind test of new LLMs - new stealth gpt2-chatbot tested!

Doing my regular blind performance tests on LMSYS.org, a new LLM was introduced to answer my new, updated causal reasoning test, that now tests for the "most likely causal reasoning patterns". A harder test for new LLMs. Definitely a non-standard benchmark. 

You find the new gpt2-chatbot results and my new test itself in my new video, published here:
>>> NEW LLM Test: Reasoning of gpt2-chatbot
 https://youtu.be/oRnYBp_5X-o 

5 days ago • code_your_own_AI

Hi community,
Part 3: Now let's have fun with MoE. 

Does it make sense to give authority to the gate master to dissect complex information in small non-complex data? Let me explain: Now we build a MoE of MoEs. In a hierarchical MoE system, each expert in the primary MoE layer is replaced by another MoE system. This creates layers of specialization where each "sub-expert" can be highly specialized in a very narrow aspect of the problem space that the primary expert is responsible for. This hierarchy allows for an even more granular approach to problem-solving, potentially increasing the system's ability to handle extremely complex and varied tasks.

Each MoE in the hierarchy would have its own gating network. This network would need to be trained to not only route inputs correctly at its level but also to interact effectively with both the layer above (if any) and the experts it controls. This adds layers of complexity in the routing logic, potentially increasing the difficulty of training these networks to make effective decisions.

While a MoE of MoE architecture can theoretically handle extremely complex problems by leveraging distributed computing, it also multiplies the computational overhead due to the multiple gating mechanisms and the interaction between different levels of experts. 

Q single: Shall we build it? Argue why. 

7 days ago • code_your_own_AI

Hi community,
Part 2: Mixture of Experts (MoE) are a beautiful and complex topic. Let us focus now on the Complexity of Integration: Managing the outputs of various specialized experts and integrating them into a coherent overall output can be challenging, especially when different experts might have conflicting interpretations or analyses of the data.

During training, as the gating network routes different kinds of inputs to specific experts, those experts receive a non-uniform sample of the overall data. This selective exposure leads experts to develop specialized capabilities tailored to the features of their respective data subsets.

Now the backpropagation of errors in training not only adjusts the weights of the expert networks but also tunes the gating mechanism to optimize expert selection. This dynamic adjustment helps fine-tune the specialization process, making the gating decision more aligned with the input features over time. 

But what happens if the output from our experts (in a MoE) are non coherent, maybe even contradict each other?

What YOU think:
----------------------

Q1: Is this a similar szenario, when the RAG system retrieves and augments some particular information from external sources (agent), that might absolutely contradict the internal parametrized knowledge of the LLM during its pre-training? Discuss the potential impacts on system robustness and data integrity in hybrid AI systems.

Q2: If different experts may produce outputs that vary significantly due to their specialized training, this divergence can manifest as different predictions, confidence levels, or even differing types of outputs depending on the experts' design. Some MoE architectures include an additional neural network layer or algorithmic step specifically designed to resolve conflicts among experts. This layer analyzes the experts' outputs and their associated confidence levels to adjudicate disagreements. Propose a detailed architecture for a conflict resolution layer in a hybrid MoE-RAG system that addresses expert disagreements. 

Q3: With synthetic data augmentation, if the AI can itself determine the art of the conflict, it can design targeted data augmentation or retraining with synthetically curated datasets that specifically address the conflicting scenarios, without human intervention. Would you trust this self-correcting AI MoE system? Like ...in  medical diagnostics?

Q4: If we have 128 experts in a single MoE, calculate the probability that any 2 experts generate a conflict. Include a progressive complexity function, in relation to the model size. Use no specific data, just some mathematical ideas how to solve this.

Looking forward to your insightful ideas in the comments. 

7 days ago (edited) • code_your_own_AI

Hi community,
yes, run all new OpenELM from Apple on MLX (!) with only two lines of code (M1 to M3):
----------------------------------
Step 1: pip install mlx-lm
Step 2:  mlx_lm.generate --model mlx-community/OpenELM-270M-
Instruct --prompt "Explain the dynamics of Mixture of Experts models with Transformers" --max-tokens 512 --temp=0.0

----------------------------------
pre-converted models (converted to MLX format) are available in MLX HF community: 
 https://huggingface.co/mlx-community/ 

Special thanks to: Awni Hannun  (Apple)



 We introduce OpenELM, a family of open source Efficient Language Models. OpenELM uses a layer-wise scaling strategy to efficiently allocate parameters within each layer of the transformer model, leading to enhanced accuracy. We pretrained OpenELM models using the CoreNet library. We release both pretrained and instruction tuned models with 270M, 450M, 1.1B and 3B parameters.

Our pre-training dataset contains RefinedWeb, deduplicated PILE, a subset of RedPajama, and a subset of Dolma v1.6, totaling approximately 1.8 trillion tokens. Please check license agreements and terms of these datasets before using them.



as always:  https://huggingface.co/apple/OpenELM 

8 days ago (edited) • code_your_own_AI

Hi community,

MoEs involve replacing dense feed-forward network (FFN) layers in a transformer model with sparse MoE layers. Each MoE layer consists of multiple "experts" (typically neural networks themselves, often FFNs), and a gating or routing mechanism that directs input tokens to the most appropriate expert based on learned criteria. This setup allows the model to utilize a subset of its total capacity at any given time, enhancing computational efficiency and allowing the model to scale significantly in terms of parameter count without a proportional increase in computation during inference.

Theoretically, the right combination of experts can enhance performance beyond what each could achieve individually. However, each expert’s effectiveness is still bounded by its individual capacity—smaller experts may not adequately capture the complexities needed for high-level reasoning. Let's explore this thought ....

>>>  Now three simple Questions, Q, on current MoE. What do you think?

Q1: How does the individual capacity (size) of an expert in an MoE system affect its ability to learn complex tasks? Discuss whether having many smaller experts specialized in narrow domains could be more or less effective than having fewer, larger experts with broader capabilities.

Q2: Considering an MoE system where each expert is relatively small, what challenges might arise in integrating these experts to handle complex logical reasoning tasks? How important is the role of the gating mechanism in maximizing the synergy among experts?

Q3: Evaluate the trade-offs between increasing the number of experts versus expanding the size of each expert in an MoE system. How might these choices impact the system’s ability to perform complex logical reasoning and generalize across diverse tasks?

Please leave your thoughts in a comment, for the community to evaluate. Smile.